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PROCEEDINGS 



DONALD B. RUBIN, first having 
been duly sworn, testified as follows in 
answer to direct examination by 


MR. WITHEY: 



]you state your name and give your 
business address for the record, please? 
l : 1 d B. Rubin. My primary affiliation 


is' tno Department of Statistics at Harvard 


Ur^ ersity. 

need the address. 

On :ford Street, Cambridge, Mass. 

bin, you have submitted I guess it 
en two reports I think in this 



case. Is that your understanding? Or is 
it three? 

I believe it is three. 

Okay. And we have -- one was on -- did 
you happen to bring those with you? 

I have copies somewhere with me, but not 
on the desk here, the table here. 
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September 11, 1998; February 23 1999; and 


May 26, 1999? 


That sounds right. 


And you understood that the purpose of 


submitting these reports was to provide us 


with your opinions and the basis of your 


>ns in this case? Correct? 


m understood that there was some 


deadlines that were required in 


submitting these reports? 


I Conrect. 


Ahd y<pu understood that in submitting the 


wo €&P that Y° u had done in preparing this 


report that they -- that that work was to 


een completed, with the exception of 



ing for testimony, by the times the 


reports were to be completed? Correct? 


If I understand the question, yes. 


All right. 


That there were things that had deadlines. 


and I was -- I was to make those deadlines 


in the reports. 


Right. And as I understand it, one of the 
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areas of work that you have undertaken is 
to conduct what might for better term be 
called a multiply or multiple imputation 
analysis of data in the National Medical 
Expenditure Survey? 

Basically that is right, although it is 
ndfe. a, multiple imputation analysis 


plr> . It is creating a mult iply-imputed 


vers 


of NMES that can support valid 




'you have been — you were 


te0't i f ying 


^ MR. WITHEY: Strike that. 

You If©re deposed in both this case as well 
the Ohio ironworkers' litigation by 
mi^a^rf? Correct? 

Cdrfeet. 

Arid the last time we met for deposition, I 
believe it was sometime in January in the 
ironworkers' case, you testified about the 
process by which the multiple imputation 
work was being undertaken? Correct? 

Yes. We discussed it I think relatively 
briefly, but the issue certainly arose. 
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Q . 


A. 

Q. 



A. 

Q. 


A. 


All right. I am not going to ask you -- 
the reason I wanted to make sure we 
understand each other is I am not going to 
go hack and ask all the same questions, 
because you testified then, I assume 

honestly, about your work in performing 

N 

tl|pe Analysis? 

Yes. |I did. 


And, you understood that in that case, the 


-£ase again, there were deadlines that 
were Required to be fulfilled in order to 
sunlit your work to us, you know, in that 
c aeef^ Correct? 


Co rn et. 

An.d.,,.„lt was your intention, was it not, to 
rompleted the multiple imputation -- 
s the word you would be comfortable 



Q . Multiple imputation of NMES -- 
A. Yes. 

Q. -- by the time of your deposition in 

January in the ironworkers' case; correct? 
A. I was hoping to have that done. Correct. 
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the other major multiple imputation 
projects in which I have been involved. 
Okay. I appreciate the general 
description. Now I would like to specify 
-- be a little more specific. First of 
all, what is the name of the person at the 
Uitfes^rsity of Michigan that has been 
ig on this project? 

A. His name is Dr. Raghunathan. 

Q. CcpuLd^you spell that? 

A. R ^t j^y H U N A T H A N. 

Tw®, name s ? 

No ne name. 


A. 

Q. Oh ay. And when was he first contacted 

t<VJaSL^f° rm this work? 

A. Le| 3 M^see. Do I really know? It was 

the January deposition that you are 
referring to. Maybe a -- maybe a month 
before. Maybe a little before that. I am 
uncertain. I could figure it out if I 
went back and looked at records, but -- 
but that's the closest I can come just 
sitting here right now. 

Q. What records are you referring to that you 
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11 



have of work performed on this project? 
Well, I would have some records of phone 
calls to him; days and hours that I spent 
working on the project. 

Any correspondence? 

Written correspondence or -- 

h\> f 

emails? 



do some e-mailing. I don't think 
it has been saved, because most of 
scussions took place by telephone. 


an I think the kind of e-mails which 


weresent were, "Have you done that yet," 


sort pf e-mails that don't convey any 
intonation. it would just be a message 
te could come and pick up without 
h to return a phone message. 

How about data exchanged? Xs that 
reflected in any document or electronic 
mail ? 



Well, there were some intermediate sort of 
diagnostic outputs that were looked at 
along the way, but none of that was saved. 
Was there any other category of records. 
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w 



either electronic or actual hard copy, 
that would reflect the work that you have 
done along with the work of the doctor in 
Michigan and William Wecker Associates? 

A. I don't believe so. 

Q. Okay. And where are those records 
lQcati%d? 

Ycfuwm^an the nonexistent ones? I don't -- 
I '^hosght you made records of phone calls. 

ie calls? Oh. There are -- Raghu 
tiS name ie Raghunathan -- collected 
records of hie hours, which he would 
tto me, and I have records of my 
own pfabo ur b that I had spent on it. 

Q. Arid where are they located? 

■ x 

A. Theprfre on a calendar -- mine are on a 
ca^efnlar, and his presumably are on a 
.calendar that he keeps . 

Q. They would be a record of the time spent 
and the day on which they were spent? 

A. Yes . 

Q. Could you please collect all of those and 
provide those to Mr. Biersteker, so that I 
can have them marked as Exhibit 1 for this 
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A . 


Q- 


A. 


# 





22 

23 

24 


deposition? Would you be willing to do 
that ? 

Okay. I could. X mean I can go back and 
try to -- and try to do that. 

Thank you. 


Okay . 

MR. BIERSTEKER: We will take 

F" 

th|t Bequest under advisement. 

.... ^ MR. WITHEY: Fine. Let me know 

if have decided not to do that, Peter. 

MR. BIERSTEKER: I will let you 


kn 


MR. WITHEY: So we can raise it 


as ae possible. 


wmvwwm^ 


MR. BIERSTEKER: All right. 


MR. WITHEY: In fact, just for 

the"Record, our motions in limine I think 
are due on July 29th? Is that right, 
John? 


MR. PHILLIPS: 29th. 

MR. WITHEY: July 29th. Thank 

you. So we would want to resolve any 
dispute about our discovery of those 
records by a week before then so that we 


G Sc VI COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e®ffittiol/eDla|0^a£MSWpclf.industrydocuments.ucsf.edu/docs/hygl0001 


52299 459S 



14 


1 

2 


\ 



10 


f-Vs., 





4 

if 




21 

22 

23 

24 


Q 

A 


could obtain them and attach them, if 
necessary. 

MR. BIERSTEKER: All right. 

BY MR. WITHEY: 

Doctor, could you describe the work that 
Raghu did for you on this project? 

Sure. It will have to be slightly 
technical. I will try to make it as 

|Uilf 

nontechnical as possible. 

ThkniS^you . 

Thp: ? nuiltiple imputation of NMES was 
acp^ally divided into -- I divided it into 
thfe^phaees. And there is a first phase, 


whi c^ is the computationally far more 

;Sv>w.v.%v.s\^i 

ex feS.flsive and complicated phase, and 
thbi^f;8 the phase that I tried to -- I have 


beS^^orking with Raghunathan, and the 
second and third phase are the phases that 
were primarily allocated to -- I allocated 
to the Wecker Associates. 

The first phase was -- took the 
basic NMES data and all of the variables, 
we call them covariates, which are the 
nonexpense data, and made a summary for 
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each person of the expense data, which 
arrived in a form which we -- created in 
the form of a table for each person of 
expense, expenses for each type and ICD 9 
code, and to take that data set, which of 
course generated approximately 500 
va rap Hiles for each person, and tried to 


impute that. What multiply 


imputes means is that you just do the same 


epeatedly using random draws to 


reprilent the uncertainty about what the 


value to impute is. 

The procedure being used is or 


waa x -.M. sorry. Let me start that sentence 


jjjjssmg# The fundamental idea of this kind 
of^^^^iple imputation is based on ideas 
that^Sre sometimes called Markov Chain 
Monte Carlo methods. They are a way of 
simulating very complex distributions. 

The ideas go back more than half a century 
but have become quite popular in 
statistics and related research areas in 
the last decade as computational equipment 
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Q. 

A . 


has made it possible. 

It is a way of multiply -- 
a way of drawing simulated values i 
iterative way. What I mean by that 
don't do it in one pass. You have 
cycling around until the procedure 



zes . Even one run on relati 
icated computing equipment t 
e a day or two or three, whi 
hat the process of debugging 
r programs can be -- take a 
me . 


it is 
n an 
is you 
to keep 

ve ly 
oday 
ch 

very 


Because NMES has many var 


that^we want to impute, all these 
expenditure information by 

. i 

co^a#£or example, this turns out 
vei^cmanding process . 

Have you completed your answer? 

If -- if that answers adequate for 
purposes, then I have. 

If you want me to expand 


iables 
detailed 
I CD 9 
to be a 

your 

upon it. 


I can. 

Q. Well, I thought you were going to describe 
which -- the three phases. I'm not sure 
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A , 

Q. 

A. 


Q. 

A. 


what you just described fit into -- was 
that the first phase, or is that all 
three ? 

That was the first phase. 

Okay. 


I believe your question was what I asked 



kasib^mathan to do. That is what I asked 
hdo is the first phase. That is all 
I J|ayj£ described. 

so you did how many different computer 
riihs that took one to three days did you 

riij^ 

Ir^^^f'ing to debug the thing, it must have 
beep^^ozens. Maybe -- starting and 
figuring out problems and mistakes and 
rejpg^iing things, maybe more than dozens. 
SoFfiih time that -- was it Raghu that did 
therUns then as opposed to yourself? 
Raghunathan did all of the actual computer 
runs . 

What kind of computer does he have? 

He has, I think, a couple of them of the 
very powerful new PCs. Pentium. I don't 
know which ones. 
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Q. When did he start work on phase one? 

A. I believe back before the January 
deposition which you referred to. 

Q. So this would have been in December of 


1938? 


A. I am guessing, but again I could try to 


Q. 

A. 




re* 


|r that with a phone call or -- 
And prior to that time, you had 
worked with Dr. Schafer at Penn. State? 


Co^ 


t? 


MR. BIERSTEKER: I object to the 

of the question. 

"\et me clarify that. I have worked 

professor Schafer on other multiple 

4 

ition projects and on other papers, 
re he was my Ph.D. student at 

I never worked with him for the 
multiple imputation of NMES at all. I 
never worked with him on anything with 
regard to this particular case. 

Did you ask him to do that? 

No. 1 did not. I gave his name to Peter 
Biersteker to contact, because Joe Schafer 
and I had worked on two major multiple 



» 


G & M COURT REPORTERS, LTD 


http://legacy.library.ucsf.e3ifl)ttid/eDb|fll^aMWpdf.industrydocum^nts.ucsf.e(?u 3 /5ocs/hygl0001 


52299 4603 


19 



22 

23 

24 


A. 
Q . 


imputation projects together for the 
federal government, whose surveys are like 
this one. 

We did EDHANES, E D H A N E S, 
and PARS, F A R S, together several years 

before, and so since I was very busy, I 

hx. s 

th,pwght that these were projects that he 
tniphi] be -- that these were projects -- 

I M 

this project might be one that he would 


o take on by himself, since he had 
( a bit of experience working on such 
bt b . 

your knowledge, did he agree to 
bn that project? 



ta 


That"”iLs my understanding. 

And^yid he generate any output from that 
p#i|ect that you are aware of? 

N<^^^iat I'm aware of. I don't think he 
did. I did speak to him about it towards 
the end of his involvement, and he was 
just completely tied up with other things 
He was in the process of his last year 
before potential tenure promotion, and he 
was very much involved in trying to get 
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papers published and accepted in journals 
to secure his future at Penn. State. 

Was it after that time when it became 
known that Dr. Schafer could not complete 
work on this project that you then asked 
Dr. Raghu -- how do you pronounce it? 
athan. 

hunathan to work with you on doing 



I -- X asked him, and I -- I spoke 
that time I decided I would try to 
charge of the project and try to 
fijfpUi.^ out what resources were needed to 
actually do the multiple imputation and 
get into understanding what NMES was 
r eai:.!-^ - - the features of NMES that made 
itP^^ifficult problem and how to decide a 
muitipile imputation scheme that would 
work. 

What were the features of NMES that made 
it a difficult problem? 

Well, one is the richness of the data. 
Generally there are lots of variables. In 
particular, the fact that there are a lot 
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of expense variables, medical expense 
variables that are critical for this 
litigation, that have to be considered at 
a fairly detailed level; in other wordB, 
not just total medical expense, but 
medical expense broken down by ICD 9 
cOdes > because the Surgeon General says 
the#e;certain ICD 9 codes have different 


relative risks associated with them and, 


ore, carving them up in any kind of 


calculation for smoking-attributable 


expenditures you have to be able to carve 




out a|fferent amounts for different ICD 9 
codes, and, therefore, you can't just deal 
wi^ji^hotal medical expenditures. You have 
tobi^iiLl with them at the level of ICD 9 


codes 



So that means that you create 
many variables, many more variables than 


you would if you were just concerned with, 
for example, total expense. 

Any other? Was there significant missing 
data? 

Yes. There was significant missing data 
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but not unheard of by any means by federal 
database standards. Xt was the 
significant missing data in the sense of 
far too significant to handle by ad. hoc 
methods. 

Ar^y other attributables of the NMES data 
eepiMiiat you claim made your work more 
cprivpl ic a t ed or complex or demanding? 


WelJL also there was the desire to handle 



g information, missing data, on 


smgjTJg characteristics, characteristics 
^ of m mokinq habits for people, because it 
wa^lPslich an important variable for this 
cas^and the fact that the way the 


questions were asked in NMES created a 
re|iig^vely complex structure on variables 
rJ^ftd to smoking . 

How many of the variables related to 
smoking did you analyze? 

It is hard to answer that directly, 
because they -- because the data have kind 
of a nested structure. As I remember, 
that people were first asked, for example, 
whether they were -- had they ever smoked 
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cigarettes, and so if they said they never 
did, then they didn't pursue asking 
questions about when is the last time you 
smoked a cigarette, because it makes no 
sense . 



Then if they did, then I think 

Vj 

the^pE&sked them whether you were a current 
smd\e^ or a former smoker, and then 
de ^ n|inc[ on that branch, how many 
cija^i^ttes you smoke now versus how many 
ci^^ttes you smoked before and when you 
sted and when you quit. 

So it wasn't just a simple set of 
gue n ons that were asked of every person, 
buctSere was this branching structure to 


ich is more complicated to deal 



lore complicated to model than a 
simple 5 how old are you, how many 
cigarettes do you smoke, what is your 
mother's age when you were born, those 
simple, you know, somewhere everybody gets 
a question. That created complications 
for us as well. 

You testified there were 500 variables for 
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each person? 

Approximately. 

Do you know how many variables there were 
for that related to smoking, smoking 
status, habits? 

Well, let's see. Approximately five. 

hv < , 

That • is because how many questions were 
aiske'd on the survey, approximately? 
ilF doesn't have this one-to-one 


c®x 


spondence - 


-^because the fact it has this nesting 
Btflfeure. For some people it asked one 
question and no more. For some it asked 

IjWkViV.WkV 

One of the issues is how to code 
ti^^nformation in such a way that it was 
mdlfl^- - would be most reliable for future 
analyses. One of our objectives was to 
try to include in all the variables in 
NMES for which we were doing multiple 
imputation or using to do multiple 
imputation each of the variables that any 
of the plaintiffs used in any of the 
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cases, in the smoking cases, as well as 
the variables that the defense used so we 
would have a complete data set that could 
be used to support analyses of any of the 
types that were -- had been used in these 
cases. 


Wey^SFf^did you have access to the actual 


rom NMES ? 


lonally did not, but I believe both 
lathan and the Wecker team did. 


Whteije did you get that? 


Wh^g^e did they get it? I don't know. I 
th¥nfF^-- well, I can say -- well, I know a 


little bit. 


TepJLjrke what you know. 

Ok|pp^ At one point, Raghunathan had the 
Nf^^^ata set that was used by Harrison in 
Oklahoma. Harrison did an analysis for 
the plaintiffs in the Oklahoma lawsuit, 
and in fact, that was one of the reasons 
why we were delayed. He was using that 
data set, and Harrison had done certain 
things to that data set -- I am not saying 
necessarily they are wrong or right -- but 
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they had done certain things to that data 
set that were not right for our purposes. 


T \ 

( 1 4 

.V# H 


When were -- 


And - - 


Excuse me. When was that obtained, first 




obtained? Do you know? 


inathan must have obtained that at the 
of the Oklahoma -- when the Oklahoma 


lawsuit was going on, when it was ongoing. 


ghunathan involved as a defense 


statistician in any litigation prior to 
the. Northwest laborers or ironworkers 


t think he has ever been involved as 


a ista tistician except I have been using 


case 




if that's what you mean. 


Ye if. - 

I have been using him to do some analyses, 
primarily these propensity score analyses 
and now these multiple imputation work 
So as I understand it then -- 


-- for -- 

-- the Harrison data set was provided 
pursuant to discovery or whatever in the 
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A. 

Q. 


A . 
Q 
A. 
Q 


A. 

Q. 


A . 
Q- 

A. 

Q- 

A, 

Q 


Oklahoma Attorney General litigation? 
Correct? 

I'm not sure, but presumably yes. 

All right. And Dr. Raghunathan had that 
data set in conjunction with the Oklahoma 
case ? 

cdife t. 



Correct. 

Sc|pill#i other words, it wasn't like you had 
t @ it, the Harrison data set, and send 
iti^o him for the first time as the result 
ofWfts case; correct? 

Cor^gt. Oh, absolutely correct. 

DO you recall when that data set, the year 

‘i 

that that data set was provided 
5 Raghunathan? 

I do not. 

It would have been in 1997 or 1998? 
Correct? 

I think that has got to be right. 

Something like that? 

Yes . 

So, in other words, had the defendants 
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asked you shortly thereafter to oversee 
the project of multiple imputation of NMES 
back when Dr. Raghunathan received the 
Harrison data set, you could have begun 
work then; correct? 

Could have begun work at that time? 



gh it turns out that that data set 


was., n.g>t the right one to use in any case . 
Wa$§SI&ere another data set? 

I Plfflk -- i think that the data set that 


^ isyaaow being used came from Wecker 


iates, and I think that is the real 


NME£, What I mean by the real NMES is is 


it is not as filtered through some other 

■ 1 

data Analysts' hands. 


lid Wecker get that? 


I 'have no idea. 

Was it in conjunction with the tobacco 
litigation? 

I have no idea. 

When did you find out that Wecker had this 
data set? 

Well, this is a slight reconstruction, but 
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I think it is accurate. when we started 
finding out that there were problems with 
the multiple imputation of NMES that 
Raghunathan was doing based on 
communications between the Wecker folks 
and us and inconsistencies that were 
arjyll^g. We started realizing that we 


wef.< 


t dealing with the same data set. 


An : 4 at that point in time, Wecker had the 
daiJa. $et that let's call it the Wecker 
dala aet -- 


_ , and Raghunathan had the 

Har^Zson data set of NMES? 


Right.’ That's my understanding. 

An|§My£ey were both doing some work on it, 
ani^ey realize there was something 
di'f f etent ? 

Yes . 

I am using some lay terms but. 

That's fine. 

Okay. 

What you are saying is basically right. 
That's when we decided that, hey, there is 
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3 0 



A 

Q 


A , 

Q 

A 

Q 


a real problem here. 

So assuming the Wecker data set was 
available as of day one, whatever date 
that is, and I don't mean to suggest when 
it was, you could have begun -- he could 
have begun work on this particular project 
as|||£P*3|that date? 

WepTlp ]he -- he could have had I started to 
th^^kjabout how to do it and the right way 
toliib:^it . 

Alt^plght. 

Yes;. 

Yof%i' t know exactly when Wecker got his 
date $et? 

I jpLsInot . 

* l 

Dolipg&i understand -- who in Wecker & 
AspSp^ates have you been working with, 
particularly as it relates to phase two 
and phase three? 

Primarily Gary Harvey. 

Spell it. 

HARVEY. 

What does Mr. Harvey do? What is his 
background? 
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10 


K, 



1 “§ 


J-T 


9 


21 

22 
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24 


A . 
Q- 
A. 
Q 

A. 
Q 


His background would be, I think, in 
applied -- he was in the Air Force, I 
think, and did applied mathematics, and he 
has quite a bit of knowledge of statistics 
as well, but I don't remember his 


degrees 


In fact I don't even know if I 



evyj^pss^ave ever seen a CV for him. 
WefXlg.-r’l other than he was in applied 
mathematics in the Air Force, then you 
doteli^'know what his training was? 
Wettp^he -- he did -- he did a lot of 
I co'^uter support work and statistical 
supblfrt work at Weaker Associates for 


qui 



a few years 




Hab he* ever done a multiple imputation 


anfcilipeis before? 

I S^ffHIt think so. 

How about Dr. Raghunathan has? 

Yes . 

Now have you completed your description of 
phase one? 

Unless you want more detail. 

I may, but just your general description 
of phase one? 
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A . 

Q. 

A . 


Q . 
A. 
Q. 


I mean - - yes. 

Can you tell me about phase two, please? 
Okay. Phase two took the output from 
phase one, which had parameter values in 
it, and actually imputed individual 
expense amounts for the events for which 

K w 

t jupi WI was missing expense amounts. 

I'm not sure I got the tenses 
right*there and the singular and plural, 

So the first phase produced, 
a :r all of this iteration, is designed 
tdP* Abduce after all of this iteration 
cei which might be called parameter 

va1ues that can be used to, as input to 



tUMi&cond stage. The second stage takes 
input parameters and for each event 
with missing expenditures attaches to that 
an imputed expense amount which is 
randomly drawn from a certain 
distribution. 

Is it within a range? 

Pardon? 

Is it within a range? 
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A. 

Q. 

A. 


A. 

Q. 

A. 

Q. 

A. 

Q 


Yes. It is 
With boundaries? 

Well, it is -- it is imputed under log 
normal specification. Log of the data are 
normal. And it is subject to editing 
constraints that the values being imputed 


ca 



Q. Ok 
ap 

A. I 


§be larger or smaller than anything 


thfE%~l|s observed in the data. 


Who wrote the program for that 
fation? 

feigned it, and -- this is phase two -- 



and*Gary Harvey supervised someone or 


ma= 



fmore than one person at Wecker 
Ass otsl ates to actually write the code. 

An ld wh en did you design the program for 
php^Birtwo? 

MR. BIERSTEKER: I object to the 

f oirm. 

My guess is December. 

December '98 ? 

Probably early December. 

1998? 

Yes. 1998, I believe. 

And you were first asked to do that, to 
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work on this, multiple imputation of NMES 
— refresh my recollection. Was it 
around November, I think, 1998 or before 
then? 

MR. BIERSTEKER: Objection. 

Asked and answered. 

K. 

iyjp^st sure exactly when it was, but it 
f(^T^-bVed the realization that Joe Schafer 

JSS not do it. 

Well, I think you answered that 


Did you design any program for 


p trass'* one? 

I designed part of it . I guess I could 
add that phase one was utilizing a 
s^&g&ant ial programming effort that had 
abeen implemented by Raghunathan at 
tlie institute for Survey Research at the 
University of Michigan. So it wasn't all 
new code. It was taking that code that 
existed and had been used in various 
projects to implement the phase one. 

Well, do you know how many statisticians 
or people doing applied mathematics has 
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Q 
A , 


Q 

A 


Q. 

A. 


Q. 

A. 


Q 

A 


And what were they? 

-- one of the first that comes to mind I 
was slightly involved in maybe 10 years 
ago that was done at the Federal Reserve 
system by an Art Kennickal, K E N N I C K 


A L, I believe 


And it is the -- what is 






I think it is the Survey of Consumer 
F | nailc e s . 

Where* is Art Kennickal? 

at the Federal Reserve. I think he 



!tll there. 


at was -- and that used the same 


of idea as is being done here, the 


j 

Markov Chain Monte carlo of the particular 



tj»#bhat has been done here. 

OJ 

There^ is another effort -- I will do this 
in the order that comes to mind. 

I don't need -- 

At Erasmus University in the Netherlands/ 
a guy named Jaap Brand, medical database. 
Spell it, please. 

JAAPBRAND at Erasmus University in 
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Rotterdam. I think that was a fairly 
major application. 

On medical database? 

Medical database, yes. There were other 
people involved in that, just as there 
were with Art Kennickal whose names I can 


remember. No? 





Nf^-1 

Thjer^ was EDHANES, a long project we did 


e National Center of Health 


StafiBtics. I was involved at the 


beginning. Mainly I directed it. It also 
iJpplFfCed Joe Schafer and Ron Little; Trina 


Ezz&fcti-Rice, who is at NCHS; Meena Khari, 


at NCHS; what is his name, Johnson, 


I |g.a# t remember his first name, Wayne 


m, who is at NCHS. 


Okay. That is three. 

Okay. The Fatal Accident Reporting 
System, PARS, which is for the Department 
of Transportation, National Highway 
Traffic Safety Administration, which was 
done primarily by Joe Schafer with my 
designing aspects of it. He also used a 
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Q 

A, 

Q 

A 


Q- 


Raghunathan is involved in that I am not 
involved in, but it is through ISR, 
Institute for Social Research, at the 
University of Michigan, to multiply impute 
NHIS, N H I S. 

Which is what? 

Nafcsiisiial Health Interview Survey 



The same one that is involved in these, 
thatsiis being used in these cases. 

I ..am hot involved in that one at all. 

Let's see if there are any others 


tha tyc ome to mind. It is possible there 
are others I know about, and it is 
ce jasjteaisi nlv possible that there are several 
otfifPi that I don't know about that are 
uiidierWay. That's close. 

Doctor, as I understand your testimony, 
the only multiple imputation project that 
you have not been involved with at all, 
even peripherally, is the Erasmus 
University one that you have named? 
Correct? 
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A , 

Q- 

A . 

Q 
A , 


MR. BIERSTEKER; I object to the 

form . 

In fact, I was a slight advisor to him. 
Okay, 

But I didn't -- I wasn't involved in any 

of the day-to-day stuff. 

K... 


Nr-f-atri I involved at all in the NHIS 
project that is ongoing now, nor am I 
irjwNflred in the continuation of the Bureau 
*£ !jpa|>or Statistics project that 

is involved in. I forgot to 
mention that. I think it is ongoing. 


But you were involved originally? 


c qx.r.e. c t ? 

# 


I plil# involved in the first phase, but 
tjjg^pjs because I originated the idea, and 
now there are sort of other students who 
can do the work, and it is spreading out. 
There was a desire at the beginning, I 
think, to make sure the old man was 
involved in some sense. 

Okay. Now as I understand it then, was 
the output from phase one provided in a 
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rolling production to Wecker to do work on 
phase two, or did it have to be complete 
-- did phase one have to be completed 
before anything was sent to Wecker? 

It basically had to be completed, but let 
me clarify that to make sure there is no 


:ion . 


As I said earlier, when you are 


doing ; a multiple imputation, what you do 


randomly draw these things to 


cdmlrfete one data set, and then you go 


and do it again. So at the time, one 
Set was filled up by Raghunathan. 


Theft- that data set could be rolled over, 
iri yoUr words, to the Wecker guys while 

I 

waM&ssMig for the second data set to be 


ited. 


H6w many data sets were generated by 
Raghunathan? 

Well, we haven't completed it yet. 


So the attempt would be to generate 

between five and ten. 

How many have been generated? 
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A. Well, we are not necessarily happy, fully 
happy with any of the ones that have been 
generated so far, but perhaps two have 
been generated that we -- we may be happy 


with. 


Q. Two out of five to ten? 


A. 



j MR. BIERSTEKER: Well, objection 

I form of the question. 

| 

s, but we don't know whether they 
y the concerns that we have to make 
hat they - - that there are no more 


.ght. And I am going to ask you 
the concerns you had about bugs or 
thfe~'output that you were getting. First I 

5 

wa ^fa^ ou to complete your description of, 
have you completed your general 
description of phase two -- 


A. Yes. 


Q. -- as to what work is ongoing? 
A. Yes. 

Q. Now how about phase three? 

A. Okay. 

Q. Is that started yet? 
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4 4 


Yes . 

And what is phase three? 

Phase three is it takes the output of 
phase two and proceeds to allocate expense 
amounts that are associated with missing 
ICD 9 codes to the actual ICD 9 codes. 

In the 18 ICD 9 codes that we are 


P Fo 

i f f 7 


U & 2 . I>Q 


and everybody is using, one of them 


is, 5; f^r missing, and I said there is no -- 

* * 

t|li®se : is an event, and sometimes a dollar 
aitiount with an event, sometimes there is 


nogr dollar amount with the event, but NMES 


a* ; 

m&y categorize that event as having a 


9 19 


ig ICD 9 code. 


p M WWW WI M 


at the output of phase two, we use 


that b.8 th level of I CD 9 code as missing 
and have these events now that have 
expenses associated with them. 


The third phase takes those 
events with missing ICD 9 codes but with 


now imputed expense amounts and allocates 


them to the 17 real ICD 9 codes. So at 
the end of phase three, we have one data 
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set that has no missing values for anybody 
on anything, and it has for the expense 
amounts it basically has for each of the 
ICD 9 codes it has real total expense 
amounts. 

There was a program required to be written 


Who wrote it? 




if t know who wrote it. I designed 
4 described it to Gary Harvey and 


W^cer Associates, and we iterated on 
maxing sure what was going on was right. 


and'^fe.hen 1 think someone -- well, I don't 


think - - someone at Wecker Associates 
wts.fef# it. I don't know who the person 


,ary Harvey might have. He might 


hive assigned it to somebody. 

Now have you reviewed, yourself reviewed 
the output from phase one from 
Dr. Raghunathan? 

I have reviewed pieces of output all along 
the way and in an attempt to detect 
problems. 
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Q 

A 




And have you detected problems? 

Yes. In the past we detected problems. 
That is why we are still attempting to do 
it correctly. 

What were the problems you detected as to 

the output from phase one? 

§ 

OJ ie kinds of -- the kinds of problems 

w be because of the coding problems -- 


I :hj^nk in fact this was maybe in some 
r |t that I sent that was -- that is the 
one that held us up, I think, in last 
J^ua'ry, because the coding problems that 
dH'fl’red between the Wecker NMES, as you 


referred to it, and the Raghunathan NMES, 
nou referred to it, that we found we 

ing people who had, for example 
tLs may not be exactly right, but -- 
people who were classified as nonsmokers 
as having a certain number of cigarettes 
smoked, and people who were nonsmokers as 
having a, oh, a quitting date, things like 
that, that don't make any sense, and that 
is because the coding is different. So 
the communication between the two data 
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sets got confused and things being imputed 
that made no sense, that were just wrong. 
Did you detect any other problems? 

There is the usual kind of modeling 
problems that occur, especially with small 
sample sizes and with a large number of 
va^i£l-lb 1 es . In order to do this correctly, 
it£ 1 critical to try to include as many 
predictor variables on the missing value 

as possible to maintain consistency of the 

. ' 1 ' 

data set. 


But some of these expense 
c eft eg|>ries are quite rare. Across the 
whole., data set, there may have been only 
- 4 lul lt approximately, for example -- 25 
eMUPWi of that particular type, that ICD 9 
cc^Nf^n ambulatory or something like 


that. And of those 25, perhaps only 13 
had observed expense amounts, which 
creates a small data set with observed 
values to try to predict something, 
especially if you are trying to predict it 
from 500 variables, approximately 500 
variables. 
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And so the coding of that had to 
be done very carefully to make sure that 
the kinds of values -- that the computer 
program wasn't going crazy in some sense 
trying to invert singular matrixes and 
trying to use overflows and underflows 


on these computational problems with 
data sets. 

MR. BIERSTEKER: Off the record. 

{Discussion off the record.) 
WITHEY: 

:here such a thing as a diagnostic -- 
s it -- diagnostician for all of 
thej|||J three phases? 

Yel. 8 . 

e that process. 


WeTTfy* 5 ? that is basically the kind of 


process I am talking about, these sanity 
checks, are the values that are being 
imputed within the range of the data that 
are plausible? Are there errors that are 
occurring because you are trying to fit 
models that can't be fit? And so there is 
-- there are these diagnostic evaluations 
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A. 

Q. 

A. 


at each phase to make sure that the values 
that are being imputed look plausible and 
that the models that are being fit are 
being fit in realistic waye. 

There are so many variables that 
each iteration -- that is why it takes so 



lotsiPPl The computer is spinning, grinding 
avfay,. %o ing probably billionB of 

multiplications every few minutes. And so 
itv'itsr -- you have to check to make sure 
ttflf^e * is not some switch that got switched 
th^, wrong way. 

OkJy-; * So the diagnostic process is not 
part;.: fe pf any particular phase, but it is 

ongoing and has to be done on an ongoing 

I 

baasd.s;' as to all of the output? 

AfcPPP^tely. 

Fafiir enough? 

When you write computer programs or use 
computer programs, there are always bugs 
initially. You have to make sure that you 
have some faith in the answers at the 
end. So whenever you -- actually that is 
true when you are writing your own 
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programs or even when using canned 


jr X 


programs that are commercially available. 
It is very easy to misuse them and get 
results that make no sense. 

And I would assume that someone who is a 

diagnostician would have to really know 

K* A 

thy^slsystem you have helped create? Fair 


Absolutely. You have to know the system. 

1^' 

Y<fjt ^ve to know the data. You have to be 
wlK.ng to iterate and go back and forth. 

I . 

^ Fo r.. that reason, it is so time consuming? 

'' Alfll>lute ly. 


Particularly here, because this phase one, 

W.V.V.V.VV.V. “ *■ 


h run takes two or three days or two 


day^^it is not like pushing a button, 
looking at the output, and saying, oh, now 
I know what I did wrong. Now you know 
what you did wrong, you have to fix it. 

It is another three days before you see 
whether you have fixed it. So the process 
is very, very cumbersome. 

And I assume that if we were to employ 
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some diagnostician or person who reviews 
it, it would be equally time consuming? 
Fair enough? 

A. Hopefully at that point it will be rid of 
most of the bugs. But that is right. It 

would be -- it would be -- well, let me 

hv 

co^iplipt it. Actually not because -- let 
m<jTlgo back . 


■< If you want to run the software, 
tfc correct. If you want to examine 
the ''let's say five multiply- imputed data 
that we give you, it would not -- you 
wJP&lJI have to be the diagnostician for 
thSs| ;S , five, but it wouldn't require your 
rewriting the program and rerunning the 

Q. but it would still be time 

consuming? Fair enough? 

A. Yes. It would be time consuming. 

Q. And - - 

A. But much less so than writing and 

debugging a program. I mean an order of 
magnitude less. 

Q. Let me ask you this then. Did you -- so 
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A 

Q 

A 

Q 


Q 

A 


as I understand it then, did you also 
receive output as to phase two that you 
were the diagnostician for? 

I was as people at Wecker. Yes. 

How about phase three? The same? 

The same. 

K n 

Dt&gg have an estimate of the total 



aitio^tnt of time it has taken for you, 
D^^^ghunathan, and Wecker & Associates 
o s project? 

I no idea how much Wecker Associates 

, the time they have spent on it. I 
m^kn ^ can make a wild guess based on 

ip' 

ph ojiQfe ^ calls and -- 
Gc^ahead. 

MR. BIERSTEKER: Objection. 

DcFn^t make a wild guess. 

And Raghunathan, I would have to look at 
the records, although 1 haven't gotten any 
records from him for a couple of months, 
so I don't know what the last two months 
have been. 

How about yourself? 

I would have to go back and look. 
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Give me your best estimate. 

100 hours. Maybe more. 

Would you assume that Dr. Raghunathan has 
more time in this than you have? 

Yes . 

Would you assume Wecker & Associates has 

hx < 

mojgp^lime than you have? 


Yefsw'fl think -- I would -- I would think 


iat might be wrong. 

That's your guess estimate? 


Yef s . 


Tha tbsst hev have more time? 


Ye|eu« I think they have more time 


Anrarmy time might be off, off as well, I 


mean the number that I gave. 
And - - 


To help calibrate, I can tell you that the 
NCKS project went on for several years. 
Okay. 

Computing is better now. We know more 

about it now. 
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Q 

A 


Q. 

A. 


Let me ask you this. On the other 
projects that you have been working on and 
have completed, including the ones that 
you mentioned, have you ever had anybody 
hired or asked to come in after the 

project was completed to review it and to 

N> . 

esp^ially peer review it, if you will, 
or| to- look through it to see if there is 
any.^bugs, see if there are any obvious 
er^ii, try to determine if there was any 
me€^ s p%blogical problems, review the 
ou^it, kind of do some checks, sanity 
checTclr as you call them? 


pwwwww| 

Th 9 .s e,.. kinds of things? 

At beginning of the question, I am not 

su^^ am answering it right. 

Go ah*ead, Explain what you meant. 

Because I -- in all of these major 
projects, there are -- we want there to be 
critics. 

Right. 

And for the NCHS project, for example, the 

statisticians at NCHS who were involved in 
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helping us were critics who did a massive 
evaluation. That actually was the basis 
of their National Center for Health 
Statistics, their deciding to use multiple 
imputation in EDHANES when the old 
hot-deck methods that they regarded on the 



bap is of our evaluations as unreliable 


r e to the multiple imputations that 


we created. There are several procedures. 


Pufeiiiifitions on that. 

An|W|ffl!ese, the massive evaluations done by 


tow long did that process take? 
tere is a -- I am hesitating. 


becotu.j5e there is doing the evaluations to 
sei^'TKkt it looks good, and then there is 


it up. Writing it up takes 


peOpp#!' s time when they decide to write. 
PrJm'B%art to finish including writing it 
up, how long did it take? 

Well, if somebody is not working on it for 
a year, does that count? That's what I am 
saying. We did these evaluations. They 
decided to, this is the way to go, 
everything looks great. That was probably 
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about two years ago. And it still hasn't 
been fully written up. I am saying people 
got busy doing other things, although they 
decided this was the right way to go. 

How long did the evaluation process take? 


MR. BIERSTEKER: I object to the 




I am .trying to remember. It was spread 


ovfer several months for sure, because it 

0 f 

a full-time activity of anybody's 
Ok^ 5 ^ Several months, meaning three or 


fotft, something like that? 


PosWibly. But again that is because 


people;did it when they had time to do it. 


I think. 


AnjF^ion' t — I don't really know, 
bela/use there was evaluation -- I was not 
involved in it. I tried to in fact kind 
of stay separate from it, because I wanted 
to be objective. 

Pair enough. Do you know if there was a 
similar evaluation performed on any of the 
other multiple imputation analysis 
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Q. 

A, 



projects that you have identified? 

Yes. The DOT one. Department of 
Transportation one, the evaluations were 
done - - actually let me be clear about the 
evaluation. 

Part of evaluation is when you 
'ing to debug programs and get it to 
wo:f 3 k....i*&.cjht. That is very time consuming. 
Onde w.e are sort of happy with that, with 
t lase, then you go into the sort of 

evaluation, can it answer really 
questions. 

The first kind of evaluation is 
pass all of these sanity checks 
an : the bugs out. The next part is, 

low let's pretend somebody really 
s multiple-imputed data set and was 
usingit to address the questions of 
interest, like a user, EDHANES, does it 
stand up to that. 

Okay. 

That process is more straightforward. It 
doesn't involve this debugging. You think 
you have really done it. 
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The people that did that evaluation, as 
opposed to the debugging aspect of it, 
were those people that were familiar with 
the project? 


Yes . 

All right. And thus they were aware of 


w 


.K, 


as going on, at least generally, at 


tlfS~jtlLine the work was being done, and then 
they |ere brought in to do the evaluation, 
they part of the original team? 
rere part of the original team. 


gh they weren't part of the -- what 
be called the hard computational 

And have you or do you ever know of 
ierience in which a multiple- 
imputation analysis has been done and then 
someone was brought in who was never 
involved with any aspect of the work until 
they were asked to evaluate it? 

In some sense, yes. This project at the 
Bureau of Labor Statistics was interesting 
that way, in that at the time the Bureau 
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of Labor Statistics team was primarily 
economists. They were worried about this 
consumer expenditure survey and the amount 
of missing data in it on how people spend 
their money, their expenditures, and 
income. They were worried about income. 

And in that project, Raghunathan 
and llwere contacted, primarily 
Raghunathan originally, to multiply impute 


ybe it was me -- I don't remember 


who ? ioriginally -- strike that 


investigate the possibility of 


mubE ^ ply imputing the missing data in this 
consumer expenditure survey. The way it 
wapi||#vided up is we decided to create a 


sest case for that where we would 


create a data set at the Bureau of Labor 
Statistics that the staff there would then 
there create missing values in, give to 
us, meaning Raghunathan and me, to 
multiply impute, and give back to them to 
evaluate. So only they knew the right 
answer, because only they knew how the 
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missing values were created. 

So they had nothing to do with 
the actual imputation itself. They were 
economists at the Bureau of Labor 
Statistics who were interested in finding 
out how well it worked. So they did the 
e^^ition of it to find out how it worked 
rCL#.irf.ve to their methods and their 
suggestions. So in that sense, they were 
iJ$&!fel^sndent . Because we never -- we never 
:he right answer. 

, I am talking about someone who would 

I 

y_n cold who has never even -- who was 
c ^^ inly never involved in the generation 
or‘ design of the computer programs that 
have to be checked and analyzed, 

/as not involved with the generation 

««««« 

of any of the data, wouldn't have to check 
that, and was not involved with the 
software programs that would be used, were 
not involved with the debugging of any of 
the problems that existed, but would have 
to come in fresh, anew, and try to 
evaluate this program. Have you ever had 
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A . 

Q. 

A. 


that experience? 

MR. BXERSTEKER: Objection. It 

assumes facts not in evidence. 

The Bureau of Labor Statistics project was 
really like that. 

All right. 

K. 

Be ; q.aM&e the people evaluating it sat in on 
eeting, because they were involved 
funding of the project, but they 
thing to do with any of the pieces 


s 

in' 
ha| 
of I 




Q. 

A. 

Q- 

A. 


Actually there is another one 
thpP^ts even cleaner. I don't know how 
th^ejaluations really went. It is the 
Fe^eTS-1 Reserve example. 


T hlPlS^r ve ys . 

Coli slimer finance? 

Yes. Consumer finance. That has been 
around I think close to eight years. It 
has been out there in the real world. 

There was a paper given on it last summer 
at the annual meeting of the Statistical 
Society. I think it has been evaluated by 
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a whole bunch of people involved in it. 

How long did that evaluation process take, 
if you know? 

I don't know the people, so I don't know 
how long it took. But basically they are 
using a data set, and they are seeing how 
wejlgjt works. Apparently they think it 
wo^rkg^well. The evaluation doesn't have 


to- Bel that involved. 


You could have fooled me 



Anti, the Bureau of Labor Statistics, how 
lop^^id that evaluation process that you 


ha\he described take? 


0h=, "The final? Maybe a few weeks. 


ft-here is another example I can give 


you. , 

Never mind. 

An example that addresses exactly your 
question. 

We have to get this deposition done at 
five o'clock -- 
Okay. 
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- - or shortly there afterwards. 

Otherwise, Phillips and I won't be able to 
see our families tomorrow morning. 

I was going to say it was the Census 
Bureau application. 

Is the work still in phase three or phase 


r which phase of the work is 




Well we have all -- there are programs 
fdas^.1 phases. We are in the process of 
checking results, sanity checks, on 


t njji results of all phases. 

ere any problems with the output 
that^jyas generated from phase two that you 
haTd"16 deal with? 

Phc> • two and phase three are much more 
stfPP^htforward, because they don't 
involve this iterative simulation, Markov 
Chain Monte Carlo stuff. Initially there 
were. At present I think they are fine. 
What were the initial problems? 

Oh, things like putting in something 
upside down, putting in the inverse of 
something rather than -- dividing by 
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something rather than multiplying by 
something, the kind of coding problems 
that always occur when you are writing 
code. 

You have referred to in the past to 
writing code or coding. How is -- or 
c<^^ problems. How was that term used 
inT'-t^Pifnection with doing a multiple 
impjut&tion of NMES? 


I actually am using coding in two 


diTxe'tent ways . 
Describe both. 


Thfrlrst I talked about coding problems 


that^Jed to our problems before was coding 


responses on NMES. In other words, if you 

: '^1 

ari : ^ismoker, are you a current smoker, 


yoPHe coded one, and never smoker, you 


are coded zero. Where is the variable put 
down, how is that coded. 

You have described that? 

Right. The other use of coding, and I 
apologize for this ambiguity, is writing 
computer code, writing programs. These 
are programming problems. Computer code 
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has programming problems. 



6 5 


Q ■ 


A. 

Q. 

A. 

Q. 
A . 


Q. 

A . 
Q. 

A. 

Q • 

A. 


What is the quantity of the output that 
has been generated to date, if you can 
describe it in either electronic or hard 
copy terms? 

The quantity? 

Y p: 

thfe sort of final product or just all 
the a|nount all the way along? 

Tl|ady^.nal product . 

nal product will be a -- it is a 
opies of NMES, each copy with all 
lues filled in, no missing data at 




"Is the most -- for any -- and you did 


WI. 

tiy^s^or each individual; correct? 
E^p^.ndividual have complete data. 

Hd^> mlny megabytes of computer space would 
that occupy, the five sets? 

Five times as much as one -- as NMES 
occupies. 

And how much does NMES occupy? 

Well, how many people is it? I don't know 
how many megabytes. X would have to 
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multiply that out. It depends on how the 
individual variables are coded and stored. 
There are different ways of doing that. 

And I don't have any idea whether it is 
packed or not packed. 

What does packed mean? 

kj 

WeCypF^if 1 have a number, right, are you a 


s m*5l 


or not, that is a zero-one. Do I 


have ho take up 32 bits to store zero-one? 


I don't. It is one bit 


For Why one individual, how many 


imoautations -- what would be the maximum 



nuiffbei* of imputations done? 

We11^.right now we are planning to have a 
total of five data sets. That would be 


fliMsassss# We may end up doing ten. 
We^^but you have described how 


oftentimes there is missing data for a 
whole series, a number of variables -- 
Right. 

-- that are looked at. If there is 500 
variables, might there be someone that you 
have implied data for 100 of those 
variables, for instance? 
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R i < 


Brt 


A. They will be all -- a person will have -- 
let me start again. 

On each of these data sets, like 
the five data sets we will produce, each 
person will have fully observed data. 

Q. Yes. 500 or so variables? 

Exactly. 

e question I have is did you keep 
tFacT^ of how many of those, that data, is 
ly imputed as opposed to that which 
ta provided under the underlying 
at a? 
tely. 

TfferlP are indicators for which values are 
i :d and which values are real. 

have a recollection as you sit here 
wEaF\he most number of values that were 
imputed for any given individual? 

No. I don ' t. 

Approximately? 

I don ' t. 

Could there be as much as 100 to 200 
variables that were imputed? 
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Q. 




The only way you could ever get that high 
is if all the expense data, so if somebody 
had many, many events, medical events that 
were missing and each cell we had -- that 
- - that question is not -- is not quite 
-- it is hard to answer. 

It , ; i : e not precise? 

It*-iss a good question, so don't -- don't 
listake. Because all you care about 
a particular type of expense and an 
I code, you need to know the number of 
and the total expenses in order to 
modeling, and bo the imputation 
illy in the earlier phases is taking 
at the event level . But then it is 
tcytjyy^d to get a total number of expenses 
fcp^y&ose events. So we may be imputing 
hxlit^ids of events, but it won't translate 
into hundreds of missing values for that 
person because they get totaled for that 
type of event. 

I understand that. There is a -- okay. 

So when are you going to have 
those done? 
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Well, are you going to help us do the 
diagnoses? 1 am kidding. 




T:. 




Seriously, we are in the process now of we 
have two of these five that we think are 


pretty good, and they are undergoing 


sees now at the same time that we are 



geh'e^^ting more in hopes that we have 
gotten rid of all the bugs along the way. 


Sop^&iBfn are you going to have it done? 
Hr^P^ersteker may have asked you this 
qupg^tion previously. 

Ye^^^He asked me. 

I wWt interfere with the dirty -- 


MR. BIERSTEKER: More than once. 

• nearly as politely as you do. 
f ou very much. I was trying to be 


polite. But he is paying you, and I am 
not, so. 

(Laughter.) 

I think there is the hope that by the end 
of next week we will have five that we 
regard as acceptable. 

Did you have a conversation with 


G & M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e<^tid/eobQi3a0M^.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4654 


70 


L 


% 




1 

2 
3 

14 
< 5 
6 

7 

8 


10 


i 2 

3 


*t 4 

3.5 

f i 6 

17 

8 

f""19 

0 
23? 
22 

23 

24 


A 

Q 


Mr. Biersteker that went something like 
this, or any lawyer for the tobacco 
industry, that said -- went something like 
this: When is my report due? 

It is due on -- the last report 
is due on May 26th, supplemental report. 

Question: I won't have the 

m i|?le imputation analysis done by 
t he n ^ 

if And then a response saying -- 
no. Just end the question. 

Have you had any conversation 
ou informed Mr. Biersteker that the 
le imputation analysis would not be 



w 



ne by May 26th or the date when your 


l ^tefe upplewental report was submitted? 
ift^ire we must have. 

Okay^ And when you began this project in 
December, you were aware, were you not, or 
were you made aware, of the deadlines for 
the submission of reports? 

I was made aware that there were 
deadlines. Absolutely. I didn't -- I 
don't know whether I knew the exact dates 
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22 

23 

24 


Q. 

A . 
Q . 
A . 



Q. 


A. 

Q. 


for those deadlines, and I am assuming I 
was made aware that it was a -- trying to 
meet those deadlines was important, and I 
also was aware that it was a major project 
to undertake. 

And you informed him of that, I assume? 



Tlyat^t was a major project? 



also informed him that I thought I 
divided into these phases in such a 
at they could, that these phases 
be worked on in parallel in the 
of phase one working with the 


Rag^huriathan group and phases two and three 
wt^rkThg with the Wecker group. 

understand that once this project 
ip 1 eted that both Dr. Wecker and 
p<j¥^n%ially other defense experts that 
have relied upon NMES data will be 
expected to supplement their reports to 
the extent to which they want to utilize 
the work that you have done? 

I basically understand that. 

Okay. Do you know that those include 
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Dr. Brian McCall? Have you heard of his 
name ? 

A. Yes. 

Q. Okay. Dr. Wecker for sure? 

A . Yes. 

Q. And do you know of any other experts that 
a rpg dssentially waiting the results of 
tlfT^,. |>ro j ect so that they can review the 
J? nd su PPiement their reports other 
tl 

I :ainly am eagerly awaiting 

Ydj| are? 

A. YefriDo I know anyone else besides? 

Q. Yes .J 

A - 1 W^ot believe I know anyone. 

Q. 0%^^^ I think that is the three. What is 
yaPffijp^ilnder standing, Dr. Rubin, of the 
degree of precision required of 
plaintiffs' damage estimates in this legal 
proceeding? 

MR. B1ERSTEKER: I object to the 



22 


23 


A. 


24 


form of the question. 

I don't believe I have an answer to that. 
I don't think I have an understanding 
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Q. 

A. 

Q 


per s e. 

My objective is to try to produce 
valid point estimates and valid estimates 
of uncertainty, whatever they happen to 
be, from the data that are available. 

Would you agree that the appropriateness 
oi estimation methodology to a given 

s^-tuaC-ion depends upon the degree of 

1 " r "J 

ion required of an estimate? 

MR. BIERSTEKER: I am sorry. Can 

that question read back? I am not 
understand it, or maybe I just 
it . 

he appropriateness of an estimation 
ology to a given situation depend 
u po^ he degree of precision required of 
t t i ma t e ? 

sure I follow. I could try to 
paraphrase and -- 

Well, let me try to clarify it -- 
Okay. 

-- and see what you don't understand about 
it. 

In other words, you would agree 
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24 


A . 

Q. 

A. 

Q. 



A. 


Q. 


that in the real world and including in 
public health or in law or in academia or 
in -- in peer-review publications there 
may be different degrees of precision 
required? 

For addressing different kinds of 



$ 



And some of them require 
fcent confidence interval, for 
ance, and others require more probable 
ot degree of precision? Fair 


I |wouldn't phrase it that way, because 
t haafesfel not -- that's not quite the right 
phrase it, but I understand what 
you are driving at. I would agree with 
the general intention of your question, I 
think. 

And, therefore, would you agree that the 
appropriateness of an estimation method, a 
method of estimating a value, medical cost 
of smoking, for instance, might depend 


G & M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacyJibrary.ucsf.e<SMidieo{qQj2a0/(Wi^.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4659 



52299 4660 


7 6 


1 

2 

3 

4 

5 

6 

1 

7 

"5 

a 

10 

11 

12 
13 

4 


T? 

13 

19 
,0 


22 

23 

24 


Q 

A, 


MR. BIERSTEKER; I object to the 
form of the question. It assumes facte 
not in evidence. 

Do I understand that a doctor may 
sometimes reach that conclusion? 

Yes . 

I , u^g g^r stand that sometimes doctors reach 
those Rinds of conclusions. 

•'"Ik 

Okay . 4 Would you also understand that when 
s reach that conclusion they may 
heir patient things like, well, I'm 
0 percent sure, but I think more 
ly than not that your smoking 
ibuted to your lung cancer? 

I ©YfianSrstand that doctors may say that. 




Ok^PfP^ And you understand that often 
ddcTo^s 

MR. WITHEY: Strike that. 

Have you examined any of the data sets 
upon which studies on smoking and health 
other than the Cancer Prevention Study II 
have? 

MR. WITHEY: Strike that. bet me 
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marked as Exhibit 1, is this the 
supplemental report of you submitted 
around February of 1999, Doctor? 

It looks to be. 

(Document headed Northwest 
Laborers Litigation Supplemental 
Report of Professor Donald B. 
Rubin marked Exhibit No. 1 for 
identification.) 

MR. WITHEY: Mark this as 2. 

(Second Supplemental Report of 
Donald B. Rubin - Northwest 
Laborers marked Exhibit No. 2 


for identification.) 

BY? fflfd WITHEY: 

W kssjJas been marked aB 2 is your second 
mental report submitted in May of 


sui 

19 ^ 

It looks to be. Yes. 

And what additional work did you do in 
this case to generate Exhibit 1 other than 
what you had previously done before? 

Okay. My -- my memory is what I did to 

generate the supplemental report is I 
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actually clarified some of the parts that 
were written in the original report and 
added some various analyses. X think they 
are primarily propensity score analyses' or 
maybe entirely propensity score analyses 
that were not there in the initial report. 



also that, if I remember 


correctly, the supplemental report also 


sed Dr. Harris. I don't believe he 
the first report. 


Affright. 

* so^i^lhink there was some minor 


adi tt-fet ments to deal with that. 

Al5' right. And then Exhibit 2, what 

;y.wv.w.v., v 

ad dj;te ,ional work did you do between 


try and May to generate the second 


supplemental report. Exhibit 2? 

Okay. The primary work that is reported 
are propensity score analyses comparing 
CPS II with national representative 
databases. 

On page 19 of Exhibit 1, you list three 

conditions? 
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Correct. 


And those three conditions relate to the 
propensity scoring? Correct? 

Right. Propensity scoring analysis to see 
how far apart groups are, treatment 
control groups are. Yes. 

The- y stlon I have for .you is have you 
examined the data sets upon which other 
stitiiii of smoking and health besides 
CP*Q{^-- let me go back. 

Have you examined the data sets 
^upc^Wich other studies of smoking and 
^eapgifeg|jbesides CPS II were based to 
detHS^ttne if any of the three conditions 


on page 19 of your February report 


exist tJ- 



MR. BIERSTEKER: Objection. 

-- IWere satisfied I should say? 

MR. BIERSTEKER: I object to the 

form. 

Well, if you turn to page 21, I am doing 
these analyses for NMES and NHIS, N M E S 
and NHIS, so those are not CPS II, and 
they all appear to have, all meaning I 
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A . 


have got subgroups, males, females, 
formers and currents. They all seem to 
have fairly substantial biases between 
smokers and never smokers that suggest 
that relying entirely on models with their 
linearity assumptions is dangerous. 

MR. WITHEY: Actually I am going 

to strike as nonresponsive. No 
I need to probably state the 



>n more clearly. 

ition is NHIS does not look at the 
>f smoking and health outcomes, does 
mean it doesn't to try to determine 
risk; correct? 

MR. BIERSTEKER: I am going to 

Objection to the form of the 

sure what you -- 
MR. BIERSTEKER: Can I help? Do 

you want me to help a little bit? 

MR. WITHEY: Yes. 

MR. BIERSTEKER: A data set 

doesn't really look at something. It is 
the analyst that comes behind. 
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THE WITNESS: Yes. 

MR. BIERSTEKER: But go ahead. 

MR. WITHEY: I said the data set 

which studies the relationship of smoking 
and health. Maybe that is why. 



A. 

Q. 

A. 

Q. 

A. 


MR. BIERSTEKER: That is 

MR. WITHEY: That is why 

tc : it again. I don't think I 
tt ght way. 

THE WITNESS: Yes. What 

ing me is what -- 


BTUmR, WITHEY: 


derstand there are studies o 



why - - 
I tried 
asked it 

is 

f 



Ydu understand they are 
biostatistical analysis; 
Correct. 

You understand they are 
sets that are generated 
or that are referred to? 
Correct. 


subject to 
correct? 

based upon data 
in those studies 


J 
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Now as to those -- and there are probably 
hundreds or plenty of such studies on 
smoking and human health? Fair enough? 
Fair enough. 

All right. And as to -- well, how many of 
those studies have you applied the three 
con^ions listed on page 19 of your 


ry supplemental report, if any? 


V: 


I itrcm' t think you mean applied. You mean 





ine none, because I don't think I 
:ver looked at those other data 

I am hesitating slightly, because I 


- -f"I v -"rt a y have through -- through a 

Bt^l&pt' s work, but I don't know. 

L i 

Hcpi T ^|ny studies, epidemiological studies 
ot ^exposure to other potential risk 
factors believed to cause disease have you 
assessed the three conditions on page 19 
of your February report, if any? 


MR. BIERSTEKER: Objection to 


form. 


24 A 


Epidemiological risk factors? I guess a 
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Q. 

A, 

Q. 


few. 

Which? Which risk factors and which 
diseases? 

Well, I am thinking primarily now of work 
I did on prenatal exposure to barbiturates 
and hormones as to exposure in utero, 
w hffj iithey were exposed and unexposed, and 
tIfanalyses go back 15, 20 years ago 
iriiLially when we were doing them. So we 
d: lese kinds of analyses 

I don't remember from that time 
w het ;her I examined exactly these three 
ciffWfia as listed here, but certainly 
tliJy were -- they were there, because the 
wcffTTlt refers to was done before that, 
done in the '70 b. 




-- I am specifically referring to those 
three criteria in this question. 

Well, certainly, yes. Those were looked 
at for sure. 

If your three criteria were used as a 
screen to determine if the results of 
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A, 
Q 
A, 


regression analysis are accepted for 
publication, what fraction of the papers 
based on regression analysis in the last 
say 10 years would have been published, do 
you believe? 

I don't know. I really don't know. 

hv , 

Do ..you think it is more than half? 

I mjahl t know, because it depends upon how 
-- r ’”hoJr badly biased they are. I 
cegs^ag&nly know there are examples where 
^kinde of propensity scoring 

J 

te ch niques would have prevented -- I don't 
ktJWl^bout epidemiology necessarily -- but 
have prevented the publication of 
reaults that aren't particularly good. 
Hftv.^, ; ,fou ever analyzed a data set in a 
pii!PipP|h.ed epidemiological study and found 
it? met all three of these criteria with 
respect to exposure at issue in that 
study? 

MR. BIERSTEKER: Objection to the 

form of the question. 

So have I ever seen an epidemiological 
study where X compared the background 


G fit M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e®ffitlid/eDla|0^a£)GW|W#. industrydocuments.ucsf.edu/docs/hygl0001 


52299 4670 




G & M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e®ffittiol/eDla|0^a£MSWpclf.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4671 



Can you think of any others ot her than 
that one? 

Where I have seen where they are close, 
close enough to not be concerned? 

Well, that met all three of your criteria? 
1 don't. 


Th,®t "'is not the literature that I 

j; | 

rJfPfffilrly read and then try to poke at to 


:ize. 


i Sojio: I don't. I don't offhand 


re you familiar with -- do you know 


Dr Mim Robins, Harvard School of Public 




*is what we call him. 


Sorry. 


Jails ■ or Jamie. 

And Steven Mark? Do you know those 
people ? 

Steven Mark, I know the name, but X am 
trying to - - I probably do know him, but I 
don't know him particularly well. 

Are you familiar with the article that was 


G & M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e®ffittiol/eDla|0^a£MSWp(#.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4672 





8 8 


1 

2 


' r S8§£S» 


6 




Q - 
A , 

Q. 


published in the journal Biometrics, 
Estimating Exposure Effects by Modeling 
the Expectation of Exposure Conditional on 
Confounders? 

I have not read it. No. But I - - I saw 
it referred to. 

By Harris ? 


ou curious about what the article 





BY MR 


foing to satisfy your curiosity -- 

rice you haven't read it yet, I take 
have not read it. I have not seen 

MR. WITHEY: Mark this. 

(Article entitled Estimating 
Exposure Effects by Modelling 
the Expectation of Exposure 
Conditional on Confounders 
marked Exhibit No. 3 for 
identification.) 

WITHEYi 
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If you could read the abstract. I know I 
am not going to ask you to read the whole 
thing. If you can't answer the questions 
without reading the whole thing, let me 
know, because we may have to move on, but 
was a type of propensity. 

: v MR. BIERSTEKER: Let him at least 


regad he abstract before you ask the 


THE WITNESS: Let me know where 




he^PRheaded. Sure 


MR. WITHEY: That is exactly -- 


telPL-^bim to shut up. 


MR. BIERSTEKER: I withdraw my 


objection. 


MR. BIERSTEKER: Let the record 

t Mr. Withey had a smile on his 


MR. WITHEY: I absolutely did. 


THE WITNESS: All three of us 


did. Four 


BY MR. WITHEY: 


What I want to know is the following. One 
is the type of propensity scoring if any 
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used in the analysis, any disagreements 
with the authors about what they did, and 
what happened to the association between 
smoking and the FEV-1 that was looked at. 

MR. WITHEY: While you are 

reading this, we can take a brief break. 



MR. BIERSTEKER: Read as much as 




WITHEY: Well, I don't -- 

BIERSTEKER: I understand. 

WITHEY: I don't think he 

d the whole thing. 

BIERSTEKER: I don't know 

answer all of those questions 
n the abstract, 
cess taken at 4:49 p.m.) 
cess ended at 4:42 p.m.) 

BY MR; WITHEY: 

Q. Tell me when you are ready, Doctor. 

A. we can try. I certainly understand the 

abstract, where it is going in generality. 
Q. Let me just direct a question. 

A. Sure. 

Q. A type of propensity scoring method was 


MR . 

I 

t mr. 
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used 


Yes . 


> s> ' \ 


r" 5 

KSS 



-- to adjust for confounding; correct? 
It appears that way. Yes. 

And what were they used for? 

It was used to adjust for multiple 


rs in a situation where there are 


more tihan two treatment conditions, and 


th-ac'“i| what the generalization they say is 


beyo n d what Paul Rosenbaum and I did. 
Thlilp^ls one of the ways they say they 
gefifraiize it. 


DoPp^l have any disagreement with what the 
aul^qjrs did with the propensity scoring? 


InFbr'der to get into that, I would have to 
actpjly read the article far more 
cal^illillly than I have in the last minute. 
CaH^fNlu confirm from looking at table 1 
and table 2 that the association between 
smoking and forced expiratory volume in 
one second, known as FEV-1, after these 
methods were used, did the association 
change? 

I can't determine that. I am looking at 
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table 2 to try to see. Table 1 has just 
listed the variables. And I do remember 
that's what Harris said in his 
supplemental -- second supplemental 
report . 

As I look at table 2, what it 

k 4 

seeitts to be is -- and maybe this is 
incorrect — but it seems to be where you 


do 'rio^ adjustment at all, and then what you 


u do adjustment, use the propensity 


scjNsPIpHi methods, bringing in one or more 
variables into the propensity score 


ition, but my memory of what Harris 
md what you are saying is that these 
t are saying you initially do this 
Lel-based adjustment, and then you do 
propensity, and the effect got 


brgge'Sr. And I don't see that at all. 

Maybe I am misreading it. 

So doesn't the negative relationship 
between smoking and FEV-1 increase in 
magnitude the from minus .0580 to 
minus .1133? 

Yes. After I read the table, after my one 



http://legacy. 
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minute read of the table, which is not 
much time, is all they are doing is 
bringing more variables into the 
propensity scoring estimation. They are 
not comparing regression adjustment with 
propensity score adjustment. So I don't 


see^|y it is relevant. I mean as you 
bi^ntf^in more -- you would believe the 


boxtoi result more, but that is just as 


yu 1 d if a model were correct you 
wo#ld-Ibelieve the model more the more 
cc^Snders you bring in. 

Illliil: So what they are showing, I 


thiSk/ is in a situation like this with 

pox'g'fUlial confounders you should adjust 

;■ 

fcs&sill the confounders you can and not 


j uliliWf or something simple, like chronic 
cou g n •or pack years of smoking or 
something like that, but you are better 
off adjusting for all the confounders. 
Let me see that back. 

Okay. Maybe X misread it. 

(Handing Exhibit No. 3 to 
Mr. Withey.) 
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23 

24 


Or maybe I misread Harris' comments. 

That's possible, too. 

Table 1 lists 22 potential confounders to 
the effect of current smoking on forced 
expiratory volume in one second; correct? 
That's what it says. Yes. 

A1lgp^ght. And these include such things 
aefHbd^y mass, chronic cough, former pipe 
current asthma, et cetera? 

TllPf^l what it says. Yes. 

A n^^ en on table 2, they -- I want you to 
you agree with me -- that there is 
an -analysis of what is known as a constant 
tefirW""6nly and an analysis of a constant 



wjfcfcw^he 22 covariates in table 1 that is 
lip^ps%. Did I read that correctly? 

Correct. 

And I take it then that you don't -- 
without reading the article, you can't say 
what the significance of the application 
of these 22 covariates in table 1 in this 
table means as it relates to item number 
one, the constant term only; correct? 
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I can make a good guess. They are 
referring to an equation which is for this 
the probability, the propensity score. It 
is basically saying that as you bring in 
more and more terms into the propensity 
score, aB you are adjusting for more and 

more confounding variables by the 

p' 

prfppe^sity score, the answer gets -- 


changes, and it goes down to -- the answer 
yos&ii^lieve most is the bottom line, which 


lines that adjusts most of the 


coyarrates. 



It "doesn't say anything about whether -- 
hofw Fftat answer would compare to an answer 
bcua «afitf on a model-based adjustment. I see 
inP^l^s table no model-based adjustment at 
alTTt^he usual sense of model-based 
adj ustment. 

Okay. 

Just as if you believe you like the bottom 
line you should bring, which I think I 
should in such a case, you should adjust 
for more and more confounding variables. 
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A . 
Q 


A. 

Q. 

A . 

Q. 

A, 



and using propensity scale is a good way 
of doing it. That is my read based on 60 
seconds. 

Dr. Harris in reviewing this article, 
which admittedly you haven't finished 
reading yet or started reading? 

Tha te fi right. I haven't started. 

Methods based upon the propensity scoring 


meunods actually strengthen the 

ation between smoking and an FEV-1? 
then relative to what? 

MR. BIERSTEKER: Right, 

between - - 
ustment at all. 
adjustment at all? 

W okay. 

i.. 

O) 

; * 

NoT s rllative to model-based adjustment, 
which is what I thought the implication of 
his comment. Was he was talking about my 
criticism of relying on model-based 
adjustments alone? I was saying you 
should use both propensity and model-based 
adjustments to get reliable numbers. 
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Q. I was asking about propensity.. 

MR. BIERSTEKER: Let him finish. 

MR. WITHEY: I am sorry. 

A. This article as far as X can tell does not 
address that whatsoever. So I don't see 
how that -- Harris' comment, now that I 
reas^the article, is relevant to the point 
t^-at 1 ^ thought he was making. 

Q. IFfeiink the point made was sometimes when 
ytsfetf a propensity scoring method you may 
:hat it is associated -- that the 
ation between smoking and a given 
te outcome may in fact be 
thened. 

A. SfFr-erftjthened relative to not adjusting for 
an.Y..™fii>nfounding variables. It could 

My interest is in getting the 




rrgirr answer. 

Q. For all you know the application of that 
propensity scoring method to the data on 
relative risk in CPS II might also 
strengthen the association? 

A. It could. 

Q. Now are you familiar with either Kenneth 
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A 

Q 


A . 

Q. 

A 

Q 


A, 

Q. 


Rothman or Sander Greenland? 

Yea, I am. 

Those are pretty well world-renowned 
experts on epidemiology or professors of 
epidemiological. Dr. Rothman at the Boston 
University School of Public Health; 
co at ? 



Cosr re'e t. 

Dknow them personally? 

| 

Y si 

I iPSpPame you are familiar with the/ir 
Modern Epidemiology? 

I mmm it exists. I have only glanced at 
pa^^of it . 

Lell^md read you something I juBt read to 
Dr. Miindt and see if you agree with both 
Drfa^^^othman and Greenland as well as 
Dr^^ifilndt as follows. Let me read it to 
you. This relates to the issue of 
attributable risk formulas or attributable 
fractions that you have talked about in 
your paper 
Okay. 

-- or in your report, I mean. 
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A . 

Q. 
A . 
Q 

A 


A. 

Q 

A 

Q 


(Handing documents to the 

witnese.) 

(Pause.) 

I have now read it. 

Do you agree with it? 

The "but infinity" is kind of dramatic. 

T nough. 

Ycm h^.ve to have -- how many causes do you 
have, each one itself is bounded 
percent, how many would you have to 
>o reach infinity? Dramatic, 
out can certainly can exceed 
rcent? Fair enough? 



Yc^tra^ree with that? 

SgMMt M iBllv 


Have you read the report of 
Mundt in this case? 

1 have not. 

THE WITNESS: Off the record? 

MR. WITHEY: Yes, 

(Discussion off the record.) 

MR. WITHEY: Back on the record. 

THE WITNESS: May I take a 
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22 

23 

24 


Q • 


two-minute break while you are looking 
that up? 

MR. WITHEY: 

THE WITNESS 
MR. WITHEY: 
go after five. 

THE WITNESS 


If you need to. 

I need two minutes. 
Then we are going to 



You will be done at 


two minutes after five 

? T 


MR. WITHEY: Fine with me. 

(Recess taken at 4:54 p.m.) 
(Recess ended at 4:56 p.m.) 

MR. WITHEY: Back on the record. 

WITHEY: 

|ft you to read page 15 of Dr. Mundt' s 
starting "Interpretation of 
attributable fraction measures." Just 
:he first two paragraphs, if you 



(Handing document to the 

witness.) 

(Pause.) 

(Witness examining document.) 

A. Okay. 

Q. Okay. Read aloud the last sentence from 
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A. 


A 

Q 

A 

Q 

A 


the second paragraph, please. 

1 was puzzling about that as well. 

"Thus, it is erroneous to 
interpret a single AFM as the proportion 
of disease which would be eliminated upon 
removal of a risk factor, which would 
r e;q^L|r e that the sum of all such fractions 

nc(t exceed 100 percent." 

f ""T " 1 

N One of the things that makes it 

hffrc^co' for me to interpret that last 
se|MS^|ce, is that there is some words and 

: | j 

usfet^lf words in the earlier part that I 
wapi^jading that I don't -- I'm not -- I 
wo^^^'lhave to read the beginning part to 
knplf—^hat he really meant by that. So it 
is, taken out of context. 



Does ::that sentence appear to follow from 
thjJF^pieceding part of the paragraph, the 
sentence you just read out loud? 

I'm not sure. That's -- that's -- 
Fine . 

I'm not quite sure, 
was the 

It could, if I had the context better. 


hittp: 
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22 

23 

24 


but . 

Does that paragraph accurately reflect 
what Rothman and Greenland are saying in 
the portion of the book that I just showed 
to you? 

The fact that he quotes that sentence from 
and Greenland gives me the feeling 
tl^at^ie thinks he is writing consistently 
.t, so that's why I think that I 
get the context -- 
-ght . 

-e complete to understand if it is 
: tent. 

ist it is not clear to you that that 
:entence is in fact -- follows or 
a ccura tely reflects what Rothman and 




Gits^S^iLand were saying? 

W#fW those -- those two sentences, there 
is an intermediate sentence that could be 
missing, or that intermediate sentence 
could be filled in by reading the rest of 
the document. 

Well, let me ask you this. Is it proper 
to interpret an attributable risk value as 
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22 

23 

24 


Q 

A 

Q. 


It was included in one of the other cases 
that you and X met on, so maybe it was 
Ohio . 

Look and see if it is in there. 

It is not included in this report. 

That is Exhibit 1. How about in 
Exli? |t 2? It might have been included? 


MR. BIERSTEKER: Wait, wait. 



N o^p" 

| "'<r" i 


wa 


. THE WITNESS: What is it? 

AnW^rview. On page 15. 

MR. BIERSTEKER: It is cited. It 

iB^^ clear to me whether it was attached 
orP^no-tf, Michael. 

MR. WITHEY: Okay. 

THE WITNESS: Cited at the bottom 

S l 

ofrpl^e 15 of the first supplemental 
report. 

MR. WITHEY: Let me take a look. 

THE WITNESS: Sure. I will give 
you the whole thing, so it doesn't get 
separated. 

BY MR. WITHEY: 



G & M COURT REPORTERS, LTD. 
(617) 338-0030 

http://legacy.library.ucsf.e®ffittiol/eDla|0^a£MSWpclf.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4690 


106 


1 

2 

3 


p' 


5 

6 

i 

>7 


b 





V«o3 8 

■a-"'.':. 

"S 


"fet> 


22 

23 

24 


Q 
A , 


A 

Q 


A. 


Page 15? 

Yes. The bottom of it. 

MR. WITHEY: Have we been 

provided with that, Peter? 

MR. BIERSTEKER: Yes. If you 

want another copy, I can send it to you. 

MR. WITHEY: No. That is okay. 

Okay^ 

Smmc\4kW 

WITHEY: 

k it was in your second supplemental 
you discussed the fact -- maybe it 
your first -- that CPS II 
tion was more homogenous than the 
na/lTofia 1 population? 





A*i|^ that I assume you mean the 
ddfSSf^rence between smokers and nonsmokers 
wert Si %ot as different in the CPS II as in 
the general population? 

No. I didn't -- I didn't mean that. Let 
me take a look. 

what I was saying is that if you 
looked at just the general spread of the 
distribution of variables in CPS II, they 
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¥ 

6. J 


pyF* 


were less spread out. So, for example, 
the age distribution was more compressed. 

The other -- the other -- what 
other variables? Age, race, marital 
status, education, body mass index, high 
blood pressure and diabetes, were -- were 
more::Compressed in CPS II than in these 


najj-fc i opal representative samples. 

would you believe that that 

f .•*- 

hcipMisnouB quality would apply to the 


rences, if any, or the -- yes, 

Cn smokers and nonsmokers in that 


t ion? 


f i 

' 35 


N ot ...nee essarily. 




So you don't have an opinion more 
ly than not that the differences 
n smokers and nonsmokers in the 


CIfSF'lP-i would be less than the differences 
between smokers and nonsmokers in the 
general population? 

MR. BIERSTEKER: I will object to 

the form of the question. 

Because CPS II is more compressed with 
respect to variability, then it is likely 
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1 

2 

3 




Q- 


that it -- it — it would be more 
compressed with respect to variability 
between smokers and nonsmokers. 

That is kind of what I thought. So now if 
that's true -- okay. No. Actually that 
answers the question. Let me see if there 


i s 



thing else 

p 

; s .„ I assume by more compressed, you 

jpisPjlferring to more homogenous? 
t . 

Just to clarify, if you can't -- 
ybu have a data set where all the ages 
tween 35 and 55, it is hard to get 
s and nonsmokers more different than 
55 . 

Tha is a good example. 

s if the data set varies between 20 
, then it is possible smokers and 
nonsmokers differ by 20 to 80. Not very 
likely that much, but you get the 
picture. 

And does the degree of compression of 
those variables, or the homogeneity, tend 
to support the internal validity of the 
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22 

23 

24 


relative risk derived, everything else 
being equal? 

MR. BIERSTEKER: I object to the 


A. 

Q 

A 


form of the question. 

If -- I am supposed to suppose now there 
is more overlap in the smokers and 

fkers in CPS II than in NMES, for 
expmjpi-%e? Is that what you are saying? 


Not 


am talking about the internal 



vayt iii ty. More than the general 
po|p^^t ion. Let's put it that way. That 
■ itl^Vmore -- that the differences between 
s and nonsmokers is less in the 
study than in the general 
tion? 

aunMiuMb. 

Ifti^jP^it want to use NMES, I suppose, but. 

I ^fMIupposed to assume that that is 



true ? 

Q. Yes. 

A. Right. I assume that is true. Now go 
on. I am sorry. 

Q. The question is, therefore, would that 
fact, everything else being considered. 
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3 
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13 



22 

23 

24 


render the relative risk derived from 
CPS II more reliable? 

MR. BIERSTEKER: Objection to the 

form. 

No. What it would mean is that if you 
derive the relative risk from CPS II using 
mod^yg with lots of confounders brought in 


•h,jat ofc 


hose models would be less sensitive 


toPf?lil modeling assumptions than they 
wo|x||d be in a data set where there are 
bifppKf differences between smokers and 
nonjsn.akers . 

Wo^tM it mean that there would be -- would 
it^Hean -- would the lack -- excuse me -- 
wolild the relative homogeneity of the 
CPp^Iji population mean, again everything 
elpmpmleing equal, that the relative risks 
arPRes likely to be confounded? 

Not really. It is juBt that the -- that 
the -- the ability to eliminate the 
confounding, it is easier to eliminate the 
confounding, and let me clarify. 

Everything else being equal -- you put 
everything else being equal in. So if I 
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6 



?6 


22 

23 

24 


think about that in a broad enough way, 
then I think the answer to your question 
is probably yes, because they are less -- 
there are less biases, and there are more 
-- but it is only because we are looking 
at it at the wrong population subset. It 
ie ittle piece of the population where 




t^ngs are simple -- simpler to do than 


:he right population. 

Great. 

MR. WITHEY: Thank you. Doctor, 

nothing further. 

THE WITNESS: Sure. 

MR. WITHEY: Obviously, I think 

mid state for the record -- it 
d q e s nj t need to be stated — that we in no 
waived our right to object to and 
md#»EP*to exclude any testimony based on any 
multiple imputation analysis that hasn't 
been completed yet. I don't know if I 
even had to say that, Peter. I am not 
trying to beat you up on a Friday night. 

MR. BIERSTEKER: I don't think 

you needed to say it. 
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DEPONENT'S'ERRATA SHEET 
AND SIGNATURE INSTRUCTIONS 

The original of the Errata Sheet 
has been delivered to Peter J. Biersteker, 


Esq. 


When the Errata Sheet has been 


co|ipl<%ted by the deponent and signed, a 
co^rthereof should be delivered to each 
Pa^^of record and the ORIGINAL delivered 
tofMjpgirs & Anderson Court Reporters, to 
whifli^flhe original deposition transcript 
wa kiitifel i ve r e d . 


INSTRUCTIONS TO DEPONENT 


ifter reading this volume of your 
dep ois lit ion. indicate any corrections or 
chi^^s to your testimony and the reasons 
therefor on the Errata Sheet supplied to 
you and sign it. DO NOT make marks or 
notations on the transcript volume itself 


REPLACE THIS PAGE OF THE TRANSCRIPT WITH 
THE COMPLETED AND SIGNED ERRATA SHEET WHEN 
RECEIVED. 
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ERRATA SHEET 

INSTRUCTIONS: After reading the 

transcript of your deposition, note any 
change or correction to your testimony and 
the reason therefor on this sheet. DO NOT 
any marks or notations on the 
iript volume itself. Sign and date 
drrata sheet (before a Notary Public, 
uired). Refer to Page 113 of the 
ript for errata sheet distribution 
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I have read the foregoing 
transcript of my testimony and except for 
any corrections or changes noted above, I 
hereby subscribe to the transcript as an 
accurate record of the statements made by 
me . 
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NORTHWEST LABORERS LITIGATION 


EXHIBIT NO. A _ 

7 ^# Ocjy 

J. WILLIAMS 


SUPPLEMENTAL REPORT OF PROFESSOR DONALD B. RUBIN 


I am a professor of statistics at Harvard University. I served as Chairman of Harvard's 
Statistics Department for nine years, from 1985 to 1994. A copy of my most recently-prepared 

■pr \ 

IxurrigUlum vitae is attached to this report as Exhibit l. 


I have been asked to analyze the estimates prepared by plaintiffs’ experts, Drs. Harris and 
Mr. Robots, of the excess health-care spending by the plaintiff union trust funds in the 




^ > 


^Northwest laborers litigation attributable to defendants’ alleged misconduct. 

r & rv i 

Jllyi# this Report$ll8S®$rporate all of my opinions from my original report dated November 6, 
I report mCaMitional opinions based upon my review of the new reports filed by the 
experts, DlH^P^Iris, Dement and Roberts, in late December, 1998. Thus, this Report 
pJ^iBins all &§i^y pri^pa! opinions in this case. 

pi In this Report?Iexreess five broad opinions, each of which is explained more fully below: 


\ 


1. ReliahLft :»rtri statistically valid estimates of the health-care expenditures, if 
any. iniuwe'ii by the plaintiff trusts as a result of defendants’ alleged wrongful 
condiict caftibe calculated. 


2. The an^fy.^es of the plaintiffs’ experts. Dr. Dement and Mr. Roberts, do not 
even cfa|m fo estimate the excess medical expenditures of the trusts due to 
the ef^iiP^the defendants’ alleged misconduct at all, including, particularly, 
the excess expenditures of the trusts due to the effect of that alleged 
misconduct on the trusts’ behavior. 

3. It appears that the analyses of Drs. Dement and Roberts attempt to estimate 
the excess medical expenditures of the trusts due to the existence of smoking. 
But those estimates, confidence intervals, and any statistical significance 
claimed for those estimates provided by the plaintiffs’ experts are unreliable 
and statistically invalid. 

4. Although Dr. Harris' December 30. 1998 report does attempt to estimate the 
effect of defendants' alleged misconduct generally on the medical 
expenditures of the trusts, his analyses essentially have none of the 
characteristics that they must have to generate statistically valid estimates of 
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the trusts* expenditures, if any, that were incurred because of the defendants’ 
alleged misconduct generally. Those estimates, confidence intervals, and any 
statistical significance claimed for those damage estimates provided by 
Dr. Harris consequently are unreliable and statistically invalid. 

5. Dr. Harris does not even attempt to estimate the excess expenditures of the trusts due 
tp the effect of the alleged misconduct on the trusts’ behavior. 


1 . 




PO ADDRESS THE QUESTION 

**< s»f 

Introduction 


To address the questior^SlNfiat excess sums, if any, the trusts expended as a result of 




defend; 


r 


feged misconducft,_'b'ne; must compare the health-care costs that the trusts actually 


incurre 
miscond 
occura 


at those costsj 

o start, one ha 

I 

ell as WMtwouf 


have ir^surred. Because there 


m have been in counterfactual worlds without the alleged 
^pcify precisely what the alleged misconduct was and when it 
; occurred in the absence of that misconduct and when it would 
“ direct data on the effect of defendants’ alleged misconduct on 


the trust**- excess costs, two models must be used: a behavioral model and a medical expenditure 


models 

^ .. , , , , 

behavioral modlrWtld have one or two components, depending upon whether the trust 

funds ^iecover excess t^^|tures due to (1) the general effect of that alleged misconduct, 

includiftf^efleets on the trusts’ behavior and on the behavior of the individual participants in the 


S* 


trust. 


C 


the effect of the defendants' alleged misconduct on the trusts’ behavior. 

The one-component behavioral model would estimate the effect of the alleged misconduct 
on the subsequent smoking behavior of individuals who were or would have been participants in 
the plaintiff trusts, regardless of whether that change in smoking behavior was due to changes in the 
trusts' behavior. 




J <T| .57 
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The more complicated behavioral model would have two components. The first would 
address "individual-initiated” changes in smoking behavior and the second would address "trust- 
initiated” changes in smoking behavior. The First component would estimate the effect of the alleged 
tpjSCQpdua on the subsequent smoking behavior of individuals who were or would have been trust 
ants, assuming no trust- initiated behavior was inhibited by the alleged misconduct. 1 call 
"individual-initiated" component of the behavioral model because its focus is on the savings, 
; . to the trusts, due^possible changes in individuals' smoking behavior caused exclusively by 
on the individual the lack of the alleged misconduct in a counter factual world. 

The second coftipofilu of the behavioral model -- the “trust-initiated" component — would 


r 



estimate the effect of ^feeg|£||fged misconduct on trust behavior in further reducing the smoking of 
iMndllidual membetl^odgh, for example,' preventative programs. I refer to this as the “trust- 
led” coffiihent <$%e behavioral model because it focuses on the savings, if any, to the trusts. 


pir||> possible chang afekteiri dividuals’ smoking behavior caused exclusively by the effect of some 




action takgji^fjj^ the trust as a result of the lack of the defendants’ misconduct in a 
erfactual world, ^ 

The second m&SpMkhe medical expenditure model - would estimate the effect on the trusts’ 

nditures of those^^^es in smoking behavior estimated from the First model. The cost savings 

| 

at a trust could have effected in a counterfactual world without the alleged misconduct are the 
ffflce d costs, if any, caused exclusively by trust-initiated changes in the smoking behavior of 
individual trust members. 

There are several other essential features of the proper data collection and modeling approach 
to address the questions posed by this lawsuit. For example: 


3 4 * 7 
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1. Both models must control for important background and other confounding 

variables so that they compare behaviors and costs for like individuals, that 
is. for matching individuals in the actual world (with defendants' alleged 
misconduct) and in the counterfactual worlds (free of that alleged 
misconduct). 





Both models must take into account the passage of time, beginning when the 
alleged misconduct occurred and continuing through each subsequent, 
relevant year. 


Both models must focus on the population of interest. Here, that is those 
individuals who were or would have been recipients of health-care coverage 
from on^of the trusts in this litigation. 

Both mdiils^must consider distinct types of smoking behavior that can lead 
to diffeif^it health-care expenditure outcomes. 


For thos$ 
the assi 
each suel 



ts of either model that rely on assumptions rather than data, 
ns must be explicated and justified. It is critical, moreover, that 
mption be capable of being individually assessed and altered. 

that smokers' other health-related behavior may be affected 
misconduct must also be considered. 


techniques employed in both models and in the studies upon 
|odels rely must be reliable and statistically valid. For instance, 
missin g das amust be addressed in an appropriate manner; adjustments for 
backgrlmnd #id other confounding variables must be made after taking into 
accounTdtffcrences in the distribution of those variables in the groups being 
rnmp afcBd^a^ rt sound statistical methods must be employed to reflect the 
reliability and uncertainty in the resulting estimates. 

The expenditure model must consider distinct types of health-care 

expenditures that can be differentially influenced by smoking behaviors. 


The staffs; 
which 



Health-care costs that would have been incurred in counterfactual worlds 
without the defendants’ alleged misconduct must be accounted for when 
estimating damages. For instance, any increased health-care costs due to 
smoking that occurred before any alleged misconduct cannot be attributed to 
that alleged misconduct because those costs also would be incurred in 
counterfactual worlds without the alleged misconduct. 


10. To the extent that the law limits the trusts' recovery to excess health-care 
costs incurred because of the effect of the defendants’ alleged misconduct on 


I -4- 

j 

j 

! 
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St"' 

F 


the trusts' behavior, then savings in health-care costs that would have 
occurred in a counterfactual world without the alleged misconduct and 
without any modified behavior on the part of the trusts ( i.e ., individual- 
initiated changes in smoking behavior) must also be excluded. For instance, 
cost savings from individuals whose smoking would have decreased without 
the defendants' alleged misconduct, regardless of whether or not the trusts 
modified their behavior, must be excluded. 

Finally, expenses incurred in instituting preventative programs would have 
to be deducted from any potential savings due to reduced smoking prevalence 
arising from such programs. 


B. Behavioral Model 


1. nrhejjne-component behavioral model 

Consider first;^l#ffibdel of the effect of defendants’ alleged misconduct on the smoking 
|pulation. This model was described in Section I.B.l of my original 
is also described'here. 

ed that the defendants failed to disseminate information about the health 
in 1965. I would first want to estimate the effect of that alleged failure 


behavior in the perti 
>B| For complete; 

Sup^ditwc 

m p»; | 

Sts of cigarette s 




ie smoking behai^#&f.the individuals who were or would have been the recipients of the trust¬ 
ified health care. for eac|i distinct type of smoking behavior that is considered relevant to health- 
costs. I would its prevalence in the actual world and in the counterfactual world. 

This estimattcfnhvould be based on actual world data and on expert opinion concerning the 

. 

fiffect on smoking cessation and smoking initiation of the availability ofadditional information about 
fflffiking's health risks. Such a mode! of smoking behavior would clearly have to consider 
background characteristics of the individuals who were or would have been recipients of the trust- 
funded health care: control for those background characteristics and other confounding factors that 
alter the effect of the alleged misconduct on smoking behavior, and consider how changes in 


-5- 
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smoking behavior in the pertinent population would vary from 1965 through the end of the damage 
period. 

Even when the objective of an analysis is solely to estimate aggregate quantities (/.*., not 
^.. within subgroups such as defined by year of birth and race), one must generally control for 

s I 

ground characteristics (such as year of birth and race) and other confounding factors (such as 
Imoking health-related behaviors) in order for the resulting estimates and inferences to be valid 




reliable. 


^ The task of creating a behavioral model is not as daunting as it might appear: simple 

| "TT ) 

^vilifications of actua^^^ld prevalences of different types of smoking behaviors within specific 
subgroups of peopleieifel estimates under explicit assumptions. 

2 two-component behavioral model 

s Alt^^ivelj^^T ^vo-component behavioral model would permit the plaintiff trusts to 

rate the excess Hpi|&#are costs, if any. that they incurred as a result of the effect of defendants' 
misconduct .trusts’ behavior. The second component of this model was described in 

~tion l.B.2 of my-original report, but is described in more detail here after describing the first 



ponent. 

individual-initiated changes in smoking behavior 

Consider first the “individual-initiated” component of the model of the effect of defendants’ 

alleged misconduct on smoking behavior in the pertinent population. Suppose, again, that it were 

1 

alleged that the defendants failed to disseminate information about the health effects of cigarette 
smoking in 1965, I would first want to estimate the effect of that alleged failure on the smoking 
behavior of the individuals who were or would have been the recipients of trust-funded health care. 
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assuming no changes in the trusts' behavior in the counterfactual world. For each distinct type of 


smoking behavior that is considered relevant to health-care costs. I would "estimate''’ its prevalence 
in the actual world and the counterfactual world. 

Thjs estimation would be similar to that described above for the one-component model. It 
wou jftiNfased on data and expert opinion concerning the effect on smoking cessation and smoking 


population would vary froj 


estir 


initij|tK)irulf the availability of additional information about smoking’s health risks; it would consider 
background characteristicsh»C.the participants in the trusts; it would adjust for those background 

chaliilillrics and other confounding factors that alter the effect of the alleged misconduct on 

# f -*$»- I 

sm^^^behavior; and consider how changes in smoking behavior in the pertinent 

through the end of the relevant period. This is necessary to obtain 
ihd in feren ces t hatiare; valid and reliable, even when the objective of an analysis is solely 
to estimate ag§r|||j| qua^fees (i. e .. not within subgroups such as defined by year of birth and race). 

^’5 b. |p^||rust-initiated changes in smoking behavior 

\ _ , 

^ &nJow consider t^^^egond component of the behavioral model, the “trust-initiated 

# \ : ■ 

comggpent. which concerns tl^ behavior of the trusts due to alleged misconduct. The model must 
s^^ient the previou^^^is with a posited potential reduction in the prevalence of smoking, 
alspd&figgregated by background characteristics and other confounding factors describing the trusts’ 
that the trusts would have caused in the absence of the alleged misconduct. Suppose again 
th^^^re alleged that the defendants failed to disseminate information about the health effects of 
cigarette smoking in 1965. I would want to estimate the effect of that alleged misconduct on the 
behavior of each trust, and then estimate the effect of any modified behavior by the trusts due to lack 
of alleged misconduct on the smoking prevalence of the individuals who were, or would have been, 
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the recipients of trust-funded health care. For each distinct type of smoking behavior that is 
considered relevant to health-care costs. 1 would again "estimate" its prevalence in the counterfactual 
world for members of each trust. The analysis would, in general, require a separate longitudinal 
model for each trust because any potential trust-initiated changes in smoking behavior due to lack 

jr \ 

misconduct could have varied across trusts and in time. 

Iltiril C. Health-Care Expenditures Model 

As described in. Section t.C of my original September 11, 1998 report, once I had modeled 
e||ect of the aUegetfmisconduct on smoking behavior over time, I would then model the effect 
smoking behavior on the health-care expenditures of the trusts’ members. 


This model of 



woul 


ct of the changed smoking behavior on health-care expenditures of the 
jhave to consider, characteristics of the individuals who were or would 

^ fC;"'.pmm'KSf 

been recipients Q&lpjst-funded health care. These characteristics would include year of birth, 

# uLj 

, income level, education, baseline mental and physical health, and other confounding 
:tjrs, such as healtl^gsjfed behaviors, that may be important predictors of how smoking behavior 
its the medical cSsfs'bf the trusts’ members. As with the behavioral model, ideally each such 
cteristic would bp npe as nrcri on each individual in the data set used to estimate the health-care 
mditures model lEiMthe moment in time when the alleged misconduct had an effect on that 
iracteristic for that individual. 

The health-care expenditure model also would have to take into account the passage of time 
because costs accumulate in time and smoking behavior can change in time. 

Such a model may have some general similarities to the plaintiffs’ experts medical 
expenditure models presented in litigation brought by the various states against these same 
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defendants, but without their errors, with implicit assumptions made explicit, and with proper 
consideration of the passage of time. 

D. Illustrative Examples 

The essential characteristics of the models necessary to address the trusts’ expenditures 

I 

burred due to the defendants' alleged wrongful conduct can be illustrated by examples. These 

exalnples are meant to illustrate specific issues; they are not meant to suggest that the models need 

$ 

5 to operate at this leveT eEgfc tail. The examples were presented in Sections I.D.l and I.D.2 of my 

[ here with some clarifications. Also, the discussion in Section I.D.3 


SKS3SSSSSS3KS 

ty original report is expanded here. 


f 


1 . 


iple One 



irst. 



FouLoXthe e 
'led 
onduct on the t 
at least twocou 

I 

.the part of the t 
itrast, if one on 



jal features of the modeling required here are illustrated by a basic example, 
istrates that, in order to estimate separately the effect of defendants’ alleged 
havior, one must consider the actual world with the alleged misconduct 
ljual worlds without the alleged misconduct, one without changed behavior 
another with possibly changed behavior on the part of the trust. In 

Is to estimate the effect of the defendants’ alleged misconduct in the 

i 

aggregate, then one must consider the actual world with the alleged misconduct and only one 
interfactual world. Second, it shows the need to consider background and other confounding 
Igpteristics. Third, it illustrates the need to consider the passage of time. Finally, the example 
illustrates that there is modeling uncertainty about what would have happened in a world without 
the alleged misconduct, and so assumptions and bases for assertions about counterfactual worlds 
without the alleged misconduct must be explicated. 
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In this example, we consider five worlds, one the actual world as it exists and four 
counterfactual worlds without the defendants' alleged misconduct starting in 1965. Two of the 


counterfactual worlds have no behavioral modification on the part of the trusts, and in the other two, 
the trusts initiated smoking cessation programs. We consider the same individual from one of the 

r x 

ktrust# in all of the worlds, which obviates the need to control statistically for background 


$ 

kaBfaa r&teristics and other confounding factors. We then see how this individual’s health-care costs 


in time in these five worlds, which leads to the calculation of this individual’s trust’s “excess' 
ti-care costs due tl> the alleged misconduct. 

■is 


Actual World: 
Defendants Withhold f 


1935 

1955 

1965 


s smoking 



Counterfactual Worlds: Defendants’ Disseminate 
Information on Health Risks 


Trust’s BehaviorXinchanged 


Trust Initiates Preventative 
Program in 1970 


Counterfactual 
World I 


rt| smoking 


Counterfactual 
World 2 

Bom 

Starts smoking 


Counterfactual 
World 3 

Bom 

Starts smoking 


Counterfactual 
World 4 

Bom 

Starts smoking 


ndants withhold H°D8fbpdaiUS dissemi- Defendants dissemi- Defendants dissemi- Defendants dissemi- 

■ mEnrmaririn nn health s nateJ information on nate information on nate information on nate information on 

th risks - quits health risks - contin- health risks - quits health risks - continues 

ues smoking 


■SSS 


continues smok- 


1970 ^*@1 smoking 

1985 ®s$yitssmokm 
1990 *T Qf it For 5 vears 



smoking 


smoking 


Quit for 20 years 
Quit for 25 years 


Still smoking 

Quits smoking 
Quit for 5 years 


Trust initiates smoking Trust initiates smoking 
prevention program - prevention progra 


quit for 5 years 
Quit for 20 years 
Quit for 25 years 


quits smoking 
Quit for 15 years 
Quit for 20 years 


First, consider counterfactual world 1 where there are individual-initiated changes in smoking 
behavior effected by the lack of the alleged misconduct. There, the increase in the individual s 
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health-care costs due to defendants’ alleged 1965 misconduct is different in 1 970 than it is in 1 990 . 

Specifically: 



In 1970. the measure of damages is a comparison between a 
thirty-five-year-old who has smoked continuously for 15 years in the 
actual world versus that same thirty-five-year-old who would have 
smoked for ten years but then became abstinent for five years in the 
counterfactual world. 

in 1990, the measure of damages for this same person is a comparison 
between a fifty-five-year-old who smoked for 30 years and has been 
abstinent fofyears in the actual world versus that same 
fifty-five-yjetf-oid who would have smoked for only ten years and 
quit a quart et cffi attury ago in the counterfactual world. 


ure of any incre 


defendants toward the in- 
analysts comm 


through eac 



alth-care costs of the individual due to the alleged misconduct by 
plainly will be different in different years, requiring a longitudinal 
g w ith. date of the alleged misconduct, or its first effect, and continuing 
;uenf^!§ievant year. 

#Also. notice thaifif^is individual’s smoking behavior were unaffected by the alleged 

J.-2KS \ ' 


mi^^nduct, as in count tactual world 2, and, therefore the person would have the same smoking 



beHi^br history in the cp.un|^rfactual world as in the actual world, there would be no effect of the 

^ misconduct on^^B^care costs for the individual, 

jgjF 

s Now consider the excess health-care costs to the trust caused by the effect of the alleged 
on the trusts. Relative to the actual world, in both counterfactual world l and 
cstK^lfactual world 2. there can be no excess health-care costs incurred by the trust due to the effect 
of the defendants* alleged misconduct on the trusts: the trust’s behavior was unaffected by the 
alleged misconduct. That is. although there are individual-initiated changes in behavior in 
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counterfactua! world 1 (but not in counterfactual world 2), there are no trust-initiated changes in 
either counterfactual world 1 or counterfactual world 2 . 


Now consider counterfactua! worlds 1 and 3; the individual's behavior is the same in both 


counterfactual worlds, unaffected by the trust’s modified behavior due to the alleged misconduct. 


r i 


scenario, once again, there are no excess costs to the trust due to the effect of the alleged 


r" 



hduct on the trusts' behavior because the individual’s behavior is unaffected by the trust's 

||ior; the changes •tfo teha vior from the actual world are solely individual-initiated. 

& Finally, consider counterfactual worlds 2 and 4. In this comparison, the increase in the trust’s 

^ l -rr] 

health-care co^s^fbrlhis individual due to the effect of the alleged misconduct on the trust 

are the extra costs froi^l95Q| when the trust initiated its preventative smoking program and affected 

individual’s smok ^jbelj avior, through the end of the relevant period. This amount is in contrast 

: potentfaJ||crea^^ iTjedical costs for the individual caused by the alleged misconduct, which 

begin to 965 if that individual’s smoking behavior changed in 1965, as in 

world 

2. j Example Two 

In this exarr^^i^ere is the actual world, as it exists with the defendants’ alleged 
.misconduct, and twoToiliterfactual worlds without the defendants' alleged misconduct. In 
Counterfactual world 1, without the alleged misconduct, there were no behavioral changes for either 
iividual or the trust. In counterfactual world 2, without the alleged misconduct, in contrast, 
in 1975 a trust-initiated smoking prevention program caused the individual to stop smoking, but the 
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cessation of smoking by the individual was accompanied by overeating, 1 which had its own adverse 


health effects. 



Actual World with Alleged 
Misconduct Starting in 
1965 


Bom: has family history of 
heart disease 

Starts smoking Jiasagily; nor¬ 
mal diet 

r- 

Defendants whftf^Sldjinforma- 
tion on health rof^tontinues 
smoking , 

Still smoking! 
overweight 


; slightly 


Contracts lur^p^r and dies 
No longer alh 






Counterfactual Worlds: Defendants 
Disseminate Information on Health Risks 


Counterfactual World 1: 

No Change in 
Trust Behavior 

Bom; has family history of 
heart disease 

Starts smoking heavily; normal 
diet 

Defendants disseminate infor¬ 
mation on health risks - contin¬ 
ues smoking 

Still smoking heavily; slightly 
overweight 

No change 

Contracts lung cancer and dies 
No longer alive 


tl 


Counterfactual World 2: 
Trust Initiates Smoking 
Prevention Program 

Bom; has family history of 
heart disease 

Starts smoking heavily; nor¬ 
mal diet 

Defendants disseminate in¬ 
formation on health risks - 
still smoking 

Trust starts smoking preven¬ 
tion program - quits smok¬ 
ing; gains weight 

Becomes severely 
overweight 

No change 

Has heart attack and bypass 
surgery 

Has second heart attack. 

Still receiving health benefits 
from trust 

Dies 


This example illustrates additional essential characteristics of the proper data collection and 

■WOTCOCTOWftVJ ^ B — - 

modeling approach beyond the need: (a) to consider the behavioral modification of trusts and 


1 A not unlikely possibility. See. American J. of Epidemiology 1998; 148:821-830, 831-832, 
suggesting that women who permanently quit smoking gained, on average. 19.2 pounds over five 
years. The corresponding weight gain for men was 16.7 pounds. 
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individuals in counterfactual worlds, (b) to consider background and other confounding 
characteristics, (c) to perform the analysis over time, and (d) to explicate assumptions. First, it 
illustrates that one must consider the possibility that smokers’ other health-related behaviors {e g., 
overeating or alcohol consumption) might be different in a counterfactual world without the alleged 
| misconduct. Second, the example illustrates that trust expenditures can be higher in a world without 


r thejlleged misconduct than in a world with the alleged misconduct. 

pri'" 

k, * 

In particular, ; in the actual world in this example, the trust bears the medical expense of this 

, K 

’ individual’s lung cancer and related death in 1994, whereas in counterfactual world 2, the trust bears 


iical expense ©ftht^individual's first heart attack and by-pass surgery in 1995, as well as the 
bond heart attack i^^^.and all medical costs until the individual dies in 2005. Clearly, the costs 


T ipie. by the trust its own behavior jn the counterfactual world will be substantially larger 


}at cour 


otuafwpfid than either in the actual world or in counterfactual world 1 without trust- 



ifiated effects on tl^m^viduai because of the extra expenditures needed to treat the heart attacks. 


ass surgery and l ata &ife medical expenditures. 

* The health-ciotrex||enditure model also would have to take into account the passage of time 
use costs accum^uiBIIn time. Instead of comparing total costs through the end of the damages 


d. however, on| eOtJdd have compared costs up to some event in time, such as earliest death in 
world. Thus, for instance, in my example two, where the earliest death of this hypothetical 
Igjdi virtual occurs in 1994, his health-care costs stream through 1994 would be less in counterfactual 
work! 2 without the defendants’ alleged misconduct and with trust-initiated smoking prevention 
programs than in the actual world with defendants' alleged misconduct (or in counterfactual world 1 
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without the alleged misconduct or changes in the trust’s behavior) where the trust bears the cost of 


treating the individual's lung cancer 

Because these alternative worlds are counterfactual. there is no direct evidence to estimate 
how likely each world would have been in the absence of the alleged misconduct. Instead, evidence 
imdj ita observed in the actual world must be coupled with assumptions to estimate the likelihood 
ative counterfactual worlds. 

I Finally, from a statistical perspective, such an analysis in each counterfactual world need not 




he done for each individual who was or would have been a participant in the plaintiff trusts. Rather, 




... | . 


h $ji analysis is ni^|^ for each subgroup defined by the pertinent smoking behaviors, 
^acitgfound characterifticf||ijid confounding factors discussed previously. The subgroup specific 


K 


re then aggre; 


oss all the subgroups to create an estimate of each trusts' costs due to 


iduc^. ’ 

3.’ ^om^ients On The Task 
/' | Drawing caus al ji^ rences from observational data in this way requires care. Ideally the 
e'(|pf#\vould involve tfteirtpqt of a team of experts, including ones having knowledge about medical 
esspsiiditure models and behavioral models. Nevertheless, there is a straightforward scientifically 
tatisticaily valid frarrfcgfvork for drawing causal inferences about any medical expenditures due 
defendants' alleged misconduct. 2 


3 An overview of this framework is stated in Rubin (1998) “What does it mean to estimate the 
causal effects of‘smoking'?” Invited paper, August 10, 1998. American Statistical Association 
Section on Epidemiology. More aspects of this approach were presented at an invited Plenary Talk 
"Estimating the Causal Effects of Smoking” for the Centers for Disease Control and Prevention, 
Atlanta. Georgia. January 29. 1999. 
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To be scientific, the underlying assumptions must be explicated in detailed and disaggregated 
forms. This wav. the assumptions can be assessed individually for plausibility and can be altered 
individually to allow observation of the consequences on answers. When all the assumptions are 

bundled together, the resultant analysis may be little more than a subjective assumption of the 

F" \ 


Statistically, each instance of defendants’ alleged misconduct (specified by its character and 

k. 

its tiding) can be viewed as defining a level of a factor in a hypothetical factorial experiment. For 
■example, one factor cohrfcofrespond to the defendants' alleged faiture to disseminate adequately 

5 "WW" 5 " 4 

^[o^ation regarding ijfagaMafliltb risks of smoking, and levels of this factor could correspond to the 
information dates of dissemination. A second factor could be the alleged failure 

dams to mark^Ssifer'' cigarettes. T.he combinations of different levels of these factors 
defin|l||j defe^pli’ conduct in a counterfactual world without specific acts of alleged 
nduct. Issues of%aafnve versus synergistic effects of the alleged acts of misconduct would 


besac|dressed by positedJ^practions in the hypothetical experiment. For instance, in the absence of 

^g&eractions, the ccist-s -due to specific alleged acts of misconduct would simply be the aggregate 

. ... . .. 

ks^eI) of the costs acrQ|||he^pecific, individual alleged acts of misconduct. This framework allows 
the explication and di|^i^gling of assumptions needed to address the question of the effect on the 
>' health care costs due to defendants' alleged misconduct. 

^ Apart from any applicable legal considerations, the question of the effect of defendants’ 

. 

alleged misconduct on the health-care expenditures of the plaintiffs, in principle, can be addressed 
by statistical analyses of appropriate data coupled with explicit statistical assumptions. Any 
statistical analysis that can address this question must distinguish between smoking behavior affected 
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and unaffected by that misconduct; take into account the passage of time; adjust reliably for 
differences in background characteristics and other confounding factors between those exposed to 
cigarette smoke and those not exposed to cigarette smoke: focus on the relevant population; explicate 
assumptions; use valid statistical methods; and account for the timing and nature of defendants* 

T X 

i | 

alleged• nduet (e g., exclude any increased health-care costs due to smoking that occurred 


before 


r" 




II. 



misconduct by defendants). 


TO ADJUST FOR BACKGROUND AND OTHER CONFOUNDING 
RENCES BETWEEN SMOKERS AND NONSMOKERS. 


>s utilize coarse CBTtgdrical or regression techniques, which, based on my analysis of 


' Any^halysis of smoki^g^at^ibutable expenditures due to the mere existence of smoking or 
due s cjfok&aj the defendants’ aUeged misconduct must properly adjust for differences in covariates 
(background and other con^^^tg variables) in a medical-expenditures model. All ot the 
ii|^®? litigaM^;S»mply rely upon studies performed by other researchers for such 
adjustments. noi^^^e studies relied upon by Dr. Dement or Dr. Harris that I have 

reviewliSfioperly adjusts fSSJplortant differences between smokers and nonsmokers. Instead. 

C' 

these s| 

differehf^ in the distributitj^|§#6variates between smokers and nonsmokers, produce results that 

P» j . ' , 

do tiS^^bly adjust for ijigfllant covariate differences between smokers and nonsmokers. To 
obtairtTejtable adjustments for covariates, it is important to combine regression modeling with 
match4fg%nd/or subclassification using propensity scores. 

Background on Adjusting for Covariates in Observational Studies 
The statistical literature warns that regression analysis cannot reliably adjust for differences 
in covariates when there are substantial differences in the distribution of these covariates in the two 
groups. 


-17- 


/7^S7 

http://legacy.library.ucsf.edu/fBd/ 0 oqOMpO/llM/vw industrydocuments.ucsf.edu/docs/hygl0001 


52299 4717 



For example. William G. Cochran, who served on the Advisory Committee that wrote the 
1964 Surgeon General's Report, wrote extensively on methods for the analysis of observational 


studies, as summarized in my chapter on his work on this topic in W.G. Cochran's Impact on 


Statistics (Rao. 1984, John Wiley. New York). In 1957 Cochran wrote: 


"... when the x-variables show real differences among groups — the 
case in which adjustment is needed most — covariance adjustments 
[i.e.. regression adjustments] involve a greater or less degree of 
extrapolation. To illustrate by an extreme case, suppose that we were 
adjusting Cor differences in parents’ income in a comparison of 
private and 'jm|b|ic school children, and that the private-school 
incomes rifled from $10,000-$ 12,000, while the public-school 
|rom $4,000-$6,000. The covariance would adjust 
|y allegedly applied to a mean income of $8,000 in 
>ugh neither group has any observations in which 
pr even near this level.” Cochran, William G. 
lariance: Its Nature and Uses.” Biometrics. V ol. 13, 



incomes r 
results so 
each group, 
incomes gS 
“Analysis^ 

pp. 261-2 






“If the oriffftlf'i-distributions diverge widely, none of the methods 
[e.g. regr^pIPhdjustment] can be trusted to remove all, or nearly all. 
the bias. T$ii» discussion brings out the importance of finding 
comparis$^™gF4ups in which the initial differences among the 
distributibttrof the disturbing variables are small.” 


e same article: 



"With -variables, the common practice is to compare the 

marginal distributions in the two groups for each x-variable 
separately. The above argument makes it clear, however, that if the 
form of the regression of y on the x’s is unknown, identity of the 
whole multi-variate distribution is required for freedom from bias.” 
Cochran. William G. ‘‘The planning of observational studies of 
human populations.” The Journal of the Royal Statistical Society. A. 
Vol. 128. pp. 234-265. 


In particular, there are three basic distributional conditions that in general practice must 


simultaneously obtain for regression adjustment (whether by ordinary linear regression, linear 

-18- 


http://legacy.library.ucsf.ed ffittid/oeql9ff|a§ll|Swii(w.industrydocuments.ucsf.edu/docs/hygl0001 


52299 4718 







logistic regression or linear-log regression) to be trustworthy. If any of these conditions is not 
satisfied, the differences between the distributions of covariates in the two groups must be regarded 
as substantial, and regression adjustment is unreliable and cannot be trusted. These conditions are: 

!. The difference in the means of the propensity scores in the two groups being 
ed must be small ( e.g ., the means must be less than half a standard deviation apart), unless 
lation is benign in the sense that: (a) the distributions of the covariates in both groups are 
symmetric, (bT'th e^ch stributions of the covariates in both groups have nearly the same 

:es. and (c) the $ampl§ sizes are approximately the same. 

■sks | -r 1 

? ^ 2. The ratto^fn|e variances of the propensity score in the two groups must be close to 

one (e.g., Yt or 2 are f£iiffe|xtrerne). 

fie var ‘ ances °f l ^ c residuals of the covariates after adjusting for the 

prt^«nsity..s|^rif:mus^^ close to one (e.g., 14 or 2 are far too extreme). 

. calculations relevant to these points can be found, for example, in 


^ S P eClflC 

Iran and Rubin (l^OX^Control ling Bias in Observational Studies: A Review.” Sankhya, Series 
ol. 35. Part 4, pp. 4 1J-446; Rubin (1973) “The Use of Matched Sampling and Regression 
stments to RemilffPIlfas in Observational Studies,” Biometrics , 29, pp. 185-203; and Rubin 
Using Multw^S Matched Sampling and Regression Adjustment to Control Bias in 
rvational Studies.” Journal of the American Statistical Association, 74, pp. 318-328. 

In particular. Cochran and Rubin (1973, p. 426) state that “linear regression on random 
samples gives wildly erratic results ...« sometimes markedly overcorrecting or even (with B = 1/4 
for e x ) greatly increasing the original bias [when the ratio of the variances is one-half)”. Table 3.2.2 
in that article implies that, when the ratio of the variances of any covariate is one half, regression can 
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grossly overcorrect for bias or grossly undercorrect for bias. Relevant results from that table are 
summarized in Table l of my July 1998 Oklahoma Report, attached as Exhibit 1 to this report. 
These three guidelines and Table 1 in that report also address regression adjustments on the logit or 


linear teg scale because they. too. rely on linear additive effects in the covariates (for discussion of 

l J 

this point, fee e g., Anderson, Auquier. Hauck, Oakes, Vandaele, and Weisberg (1980), Statistical 

i 

MethdiJsjof Comparative Studies, John Wiley, New York, p. 164). 

,.V . 1 

s fed Propensity Store Analyses Comparing Smokers and Never Smokers In The 
National Pog^lHSn As Represented By NMES and NHIS 

pw different smokers and never smokers might be in the national 

propensity score analyses using NMES (the National Medical 

in my January 1998 Report in Minnesota- 1 and in my July 1998 

yses, however, were unweighted to mirror what the plaintiffs’ 

llow and in the electronic media I have produced in this case, I 

compared smokers and nonsmp.^ers by conducting a propensity score analysis using NMES and its 


weigf|y^1thereby represenfeglfie national population corresponding to the NMES design. In that 
ana]ggtj$il used the 32 colSSls selected by the plaintiff's expert. Dr. Harrison, in his Oklahoma 



J The January 1998 Minnesota Report is attached here as Exhibit 2; the July 1998 Oklahoma 
Report is attached as Exhibit 3. 
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Table 1: Propensity Score Analyses For Smokers Versus 
Never Smokers In NMES and NHIS 


Propensity Score 


Smoking 

Group 

r % 

Bias 

Variance 

Ratio 

^s^Purrent 



f (NMES) 

.81 

.97 

L Former 




1.24 


.62 ' .41 



Residuals Orthogonal to Propensity Score 
Variance Ratios in Range 

(0, Vi) (Vx, 4/5) (4/5,5/4) (5/4,2) (2. ~) 


0 

0 

15 

21 

13 

12 


6 

5 

14 

23 

25 

35 


22 

21 

61 

50 

73 

67 


3 

4 

30 

37 

20 

10 


1 

2 

15 


11 


1 also condiisfedafllur propensity score analyses using the 1991 NHIS (the National Health 
interview Survey) jfefPufing sampling weights and all of the covariates selected by the plaintiff s 
iSperts. the Cambridge Team, in the Massachusetts litigation, 4 separately by gender, as in the 
Team's analysis. Those results are also in Table l. 


4 Drs. Cutler, Epstein. Frank, Hartman, King, and Newhouse, “The Impact of Smoking on 
Medicaid Spending in Massachusetts 1970-1997 ” June 15, 1998. 
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In all instances, the differences between smokers and nonsmokers nationally are substantial 


f 


enough that regression models cannot reliably adjust for even the background characteristics and 
other confounding factors included in the regressions. 

Thus, the linear regression methods used by the studies of the national population cited by 
^Dr. arris cannot be considered as having adjusted ina reliable way for differences between smokers 
l^autd^nonsmokers. Better procedures would combine regression methods and propensity score 

L. 

subdassification, as suggested by the statistical literature. 

C. Propjiiity Score Analyses Comparing Smokers and Never Smokers In the 
Unioif%f»^t of NMES and NHIS 


The differences hctjveen smokers and nonsmokers observed nationally appear to be even 


more substantial in the umon subsets of the NMES 5 and NHIS 6 data. These results using the 




w 



s are 


on 



1 1 


inted in Table 2. Even if the experts for the plaintiff union trusts had 
ion or relied upon studies in the literature that examined a union 
lation, they muslitpilf reliable adjustments for differences between smokers and nonsmokers 
ackground char^ret^tjcs and other confounding factors; linear regression adjustments are not 
e in this setting 


i 


; §iiTspecifically looked at those individuals who reported that they had private insurance provided 
a union in any of the four rounds of interviews. To compute the propensity scores, I again used 
^32 variables in the Harrison Report in Oklahoma. Additional detail is found in electronic 
pals accompanying my reports in this case. 

6 1 examined in NHIS the same population that Dr. Dement used in his expert report in the Ohio 
union litigation as a surrogate for the union population when he estimated smoking prevalence 
values. To compute the propensity scores. I used 34 of the variables employed by the Cambridge 
Team's June 15. 1998 Report in Massachusetts, excluding industry and occupation variables, and 
collapsing some other indicator variables; additional detail is found in the electronic materials 
accompanying my reports in this case. 
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Table 2: Propensity Score Analyses For Smokers Versus 
Nonsmokers In Union Subsets of NMES and NHIS 


Propensity Score 


Smoking 

Group 


Bias 


Variance 

Ratio 


Residuals Orthogonal to Propensity Score 
Variance Ratios in Range 

(0, !4) (54.4/5) (4/5, 5/4) (5/4,2) (2,~) 


rent 
IMES) 


er 


4. .. (N||IS) 
FoBner 


.59 

.69 

.86 


.11 


.23 


K, 





.76 




IS) 


D. 


.95 ? 


.96 


4 

5 
2 

0 


5 

7 

5 

6 


16 

16 

24 

20 


5 

4 

3 

5 


0 

0 

3 


(odeli 


e Adjustments Combine Regression 
d Propensity Score Methods 


Tt 


There is a qu: 
laf regression mo 
ests, however, tl 




■ century of statistical literature, (e.g ., Cochran and Rubin, 1975) suggesting 
lone is unreliable in the situations presented. This same literature 
fining regression modeling with matching and/or subclassification using 
propensity score ^sju^tantially more reliable. For example, in addition to the references cited 

and Rubin and Thomas (1999). Moreover, based upon my review of 
Jerature search, propensity score methods are now being relatively widely referenced in highly 
Respected medical and statistical journals. 


e. see 
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III. DR. HARRIS 


A. General Comments 

Dr. Harris makes no effort to estimate separately the effect of defendants’ alleged misconduct 
on the behavior of the plaintiff trusts and. in turn, the effect of that change in trust-initiated smoking 
F behayior on the trusts* health-care expenditures. Dr. Harris does present estimates of the effect of 


fThe defendants' alleged misconduct generally on the trusts’ health-care expenditures. His analyses. 


low^ver, have essentially none of the characteristics that are necessary to estimate the health-care 
fiSi$ incurred by the £$£^Nbecause of the existence of smoking or defendants’ alleged misconduct 


generally. As detailed morel fully below, Dr. Harris’ analyses fail to a) take into account the passage 



me in the ma: 


described; b) control properly for important background and other 


^founding charactj&'istic^; c) distinguish between different types of smoking behaviors that can 


to difflimi healStilgif expenditure outcomes; d) focus on the participants in the plaintiff trusts; 


e.Ldisagg 



asst^sptidns in a way that enables the assumptions to be assessed and varied 


ividually to see 


Snsequences on the resulting estimates f) consider the possibility that 


)kers’ other heal^-related behaviors may also change in the counterfactual world; or g) employ 


>ropriate 


or rely on the results of studies that do so. 


it 

The way DrL Harris’ analyses attempt to estimate the effect of defendants’ alleged misconduct 


generally on the trusts’ health-care expenditures is as follows. He first attempts to estimate the 
||bportion of smoking-attributable expenses that are due to the alleged misconduct. This estimation 
'ttPthree pans; (1) general estimation of medical expenditures due to the existence of smoking 
based on his review of selected literature; (2) his general subjective estimation of medical 



expenditures that would have existed in a counterfactual world without the misconduct; and (3) the 
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calculation of the resulting proportionate differences, these being his estimated proportions of 


estimated smoking-attributable expenses due to the alleged misconduct. Dr. Harris then applies 
these estimated proportions to the estimated smoking-attributable expenses provided by Mr. Roberts 
using the mortality-based smoking-attributable fractions estimated by Dr. Dement. All of the 


f \ 


ah^y|p8 conducted by and relied on by Dr. Harris that I have examined are unreliable and 
s^g^lly invalid for this purpose. 

tj. The Existence of Smoking 

To estimate the |®th^care costs incurred bv the plaintiff trusts as a result of the existence 

. ^ 

\ , f T' 1 j _...._ 


rigafette smoking (a: 


ed to defendants" alleged misconduct), one has to compare plaintiff 


ImstsHhealth-care costsun'^actual world to what those costs would have been in a counterfactuai 


wrfin which no ci 


ing existed. To be reliable and valid, such an estimate must 


manwtefahe san aa essential characteristics that 1 described above in Section I of this report. 


i Dr. Harris' estlrnate^do not have those characteristics. His estimates of these smoking- 


atfifbiitable expenditur 



nd on two quantities: the relative expenditure ratios of ever smokers 


er smokers, or the prevalence of ever smoking, or “p." His actual world values for 


"* and "r" are unreliable. 


Dr. Harris' 


ever smoking “p” values are based on a piece-wise linear model 


nut any provided statistical or economic justification. The "‘knot" value, which Dr. Harris places 
074. is not calculated from any particular data set. nor allowed to vary by background 


characteristics and other confounding factors of individuals, but is simply postulated by Dr. Harris, 
based on his subjective assessments. The line reflecting ever smoking prevalence from 1953 to 1974 
does not appear to be a close match to the data points on which Dr. Hams relies, and there is no 
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confidence interval for his determinations, even when they appear to be based on the analysis of data 
under particular assumptions. 

In determining his values for "r,” Dr. Harris relies on a group of studies of health-care 


expenditures. There are a variety of problems with those studies, illustrated below. 
f | First, it is well known in many branches of statistics, epidemiology, and economics that it 
f is necessary to control for background and other confounding variables when trying to estimate 
H:ausal effects. These variables define subpopulations where, for example, similar smokers and 


K. 


can be re 


lally compared with respect to health outcomes, including health-care 


^ 5 


expenditures. Even jwftfh interest focuses solely on aggregate costs and not on subpopulations. 


jMB&fground and othe 
jd estimates of agjj 
NeN^f^less f ® 
poss ibly t%ff6fflint c 

reliably. My review s 

■ K*' f 

,#du|t for the full set i 

liar cases have 
lax, "State Esti 



ding variables that define subpopulations must be considered to obtain 

costs. 

of the studies relied on by Dr. Harris do not even attempt to control for 
ing variables; others attempt to do so but do not adjust for those factors 
gists that none of the studies relied on by Dr. Harris for his values of *'r” 
ground characteristics that even the experts retained by the plaintiffs in 
! d for. See, e.g.. L.S. Miller, X. Zhang, T. Novotny, D. P. Rice, and 
f Medicaid Expenditures Attributable to Cigarette Smoking, Fiscal Year 


ySlh.'' Public Health Reports 113:140-151 (March/April 1998) at 141-42; Minnesota Trial 


? 


-limony of Dr. Samet at 3482, Deposition Testimony of Dr. Samet at 369-70. and Trial Testimony 
Wyant at 5372-73 and 5391-93; Oklahoma April 27, 1998 Report of Dr. G. W. Harrison at 
18-20 and 39; Vincent Miller, et al., Smoking Attributable Medicaid Care Costs: Models and 
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Results . September 3. 1997, Tables 2.4-2.8 at 23-27 and 29-30; David M. Cutler, et. a/.. The Impact 
of Smoking On Medicaid Spending In Massachusetts: 1970-1997 , June 15. 1998. Table IV.2. 7 


Moreover, none of these studies relied upon by Dr. Harris that I have examined reliably 


! and IIS as shown i 

K, 




adjusts even for the background characteristics they purport to adjust for. For example, as 

F v \ 

kpcug jfented in my January 1998 Supplemental Report in Minnesota, in my July 1998 Report in 
D k| t||g>rna, my November 20. 1998 Supplemental Report in the Ohio Union litigation, and in 

L 

II in this report^the differences between smokers and nonsmokers in NMES andNHIS are 

Iso substantial that theItWein^ted adjustments using regression models are known to be entirely 

! §r’: mti 

ej^ble. Furtherm(|^^^h differences persist even within the union-worker subset of NMES 

n II of this report. 

, none o^fe^idies on which Dr. Harris relies explicitly considers a union trust fund 
nt pop&inigon, r rtuf.h a population of union trust fund participants in the Northwest United 

i. Dr. Harris imprcTtlf appears to assume that relative expenditure ratios of ever smokers 


ired to never srrvoJ$i% computed based on the general U.S. population or populations of 
iyees of a given teffThpply equally to the participants in the plaintiff trusts, even though the 
participants mav idiffer i n important ways from these other populations. This is despite the fact 

allows, as an explanation for empirical results that counter one of his 
fries, that Ohio union members may differ in important ways from national union members. See 

js Deposition (Ohio), December 15.1998 at 86-89. Similarly, the ratio of medical expenditures 

_ill 

between nonsmokers and smokers for the participants in the plaintiff trusts may differ from that of 


deposition, DrP 



7 Ideally, any characteristic that might be considered important should be used because a 
subsequent analysis can always average over that characteristic, thereby ignoring it. whereas an 
analysis that leaves it out initially has that characteristic confounded with smoking behavior. 
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■■W' 


For example, ffta 

g 

rnt«$u}g and use the r| 
lage exa m||| |L the 
instead. usefciefitifi 

l 

for his vaiues o 


the national general population or of employees of a given firm. Yet, Dr. Harris’ analysis provides 
no evaluation of whether the values of "r" he uses are applicable to the participants in the plaintiff 
union trusts. Section IV of this report documents the substantial differences between union workers 

and the rest of the national population as represented in NMES and in NHIS. 

> x ' v \ 

•; I Third, based on my review, many of these studies relied on by Dr. Harris confront missing 
fdata problems, but, as I have described in my reports in other cases (see Exhibits 2 and 3), they use 

i * 

iques documented in the statistical literature as being flawed. See. eg., Little and Rubin, 
\itStical Analysis wi0$fissing Data, (John Wiley, NY 1987); Panel on Incomplete Data, National 

Ip'"' ^ 

Aca^my of Science^^ppg^ip/e/e Data in Sample Surveys, Vol. 1-3 (Academic Press, NY 1983). 

>f the studies relied on by Dr. Harris impute values for the data that are 
; data sets as if the imputed values were real data. But, in the instances 
, do not use sound statistical methods to impute these missing data, and, 
/alid procedures. Among the errors made by the studies relied on by Dr. 
re: using best predicted values for missing data, imputing the missing 
fat|| sequentially witfeout.,proper conditioning, imputing the missing data using only part of the 
known dat^f^uting the missing data without taking into account their uncertainty, 
ing on strong un^^^tiated assumptions, using arbitrary values to replace missing values, and 
fng to conduct any informative sensitivity analyses of the imputations, such as varying 
iamental assumptions underlying the imputations. As I have described in detail in other reports, 
^TIese\inds of errors mean that the data analyses and conclusions about u r" values relied on by Dr. 
Harris are statistically unreliable and invalid. 



''■w.v.sv.vj 
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Fourth, Dr. Harris' analyses of both "p" and “r” merely compare never and ever smokers (as 
a weighted average of former and current smokers). Differences in smoking behaviors that 
Epidemiologists. physicians, and even Dr. Harris, 8 appear to believe are important — e.g.. smoking 

intensity, smoking duration, type of cigarette smoked, or time since quitting - are nowhere 

,, % ' 

accounted for in his analyses ot either *'p” or *‘r." To be valid, such differences, as discussed above. 




k 


:be taken into account in appropriate subgroups defined by demographic characteristics and 


ealth-related behaviors. 


Finally, Dr. Imis’ analyses .do not take into account the passage of time, consider the 

r * ^ 

is&j&ility that other Ig&^ii-related behaviors ( e.g .. overeating or illicit drug behavior) may be 




C. 


^affected by the existence or smoking, or explicate necessary assumptions in a manner that allows 


..ttified and evaluated or altered to determine the effects on the results. 

_I 

e Ef||et of Defendants’ Alleged Misconduct Generally 

Dr. Harris doef^^tempt to assess separately the effect of defendants’ alleged misconduct 
qFTl^ trusts’ behaviof ? .|p^ifically (two-component behavioral model). 

In his Decem^eFTfo. 1998 report in this case. Dr. Harris does attempt to estimate the effect 



nduct generally on the trusts’ health-care expenditures (one-component 


lendants’ allege 


Vioral model). f$Ppifsents results from his one-component analysis. Although Dr. Harris 
jferly recognizes the need to consider separately the effect of defendants’ alleged misconduct on 
behavior and the effect of that change in smoking behavior on the trusts’ expenditures, the 
estimates he presents are neither reliable nor statistically valid. 


8 See, e.g.. Harris Oct. 26. 1998. "Notes on Changes Over Time in the Mean Duration of Quitting 
Among Former Smokers; Impact on 0, the Reduction in the Excess Risk of Quitting." See also , 
Undated original report of Dr. Lauer in the Ohio Union litigation, entitled "Smoking and Coronary 
Health Disease" at 12-14. and his supplemental October 22.1998 report. "Smoking and Stroke ’ at 4. 
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Dr. Harris' estimates, as an initial matter, critically depend upon his estimates of the actual 
world values of "p." and "r.” and his resulting estimates of the smoking-attributable fraction of 
health-care costs due to the very existence of smoking. Because those estimates, as discussed above, 
are unreliable and invalid, so, too, are Dr. Harris’ estimates of the effect of defendants" alleged 
misconduct, even if his assessment of the change in smoking behavior of the trusts' participants and 


the consequent change in the health-care expenditures of the trusts in the counterfactual world were 




liable and valid. 


fDr. Harris’ further analyses of the effect of the defendants’ alleged misconduct on smoking 
behavior and health-ca|e"^os|s in the counterfactual world, additionally, are neither reliable nor 
valid. Thop^^g|yses have essentially none of the characteristics that are necessary to 
estimate the health-car^feitsincurred by the trusts because of defendants’ alleged misconduct. As 
Ms actu#s&br!d gsri&a gfe s. Dr. Harris fails to: a) take into account the passage of time in the 


M for important background and other confounding characteristics; c) 
It types of smoking behaviors that can lead to different health-care 


expenditure outcomesfa)"fdcus on the participants in the plaintiff trusts; e) disaggregate assumptions 


manner I diS&pitiSSSd; b) 
between 


¥ 


ay that enables tj&$&s$ilmprions to be assessed and varied individually to see the consequences 
resulting esti m^^^ ) consider the possibility that smokers’ other health-related behaviors 
so change in the counterfactual world; or g) employ appropriate statistical methods. 

In the end. Dr. Harris’ consideration of the effect of defendants’ alleged misconduct is really 
Ifictive statement based on his opinions of what would have happened in a counterfactual world 
without that alleged misconduct. Two illustrations help demonstrate the point. 
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First. Dr. Harris appears to conclude that, in the absence of the alleged misconduct, the 


tobacco companies would have introduced “safer" products to the marketplace. By this he appears 
to mean both that products that never were introduced would have been introduced and would have 

been adopted by consumers, and that the “safer" products factually introduced would have been 

y \ 

| introduced sooner. But in his report in this case, Dr. Harris never specifies: how much "safer ’ any 


“safer” product would have been than products already on the market at that time; the rate at which 


Consumers, and participants in the plaintiff trusts in particular, would have adopted them; or what 
? the effects of switchii|?lTlmy, to any of these “safer” cigarettes would have been on the trusts’ 


heatth-care expendiu 


^rtime. Nor could he do so on deposition. See. e g , Harris Washington 


At 396, 398. ano^P?*Harris Arizona Dep. At 18, 20, 297, 322-25, and 346-52. 


Second. Dr. 


induct, itui toba 


ring and this infd 


concludes that, in a counterfactual world without the alleged 


iustry would have disseminated information about the health risks of 
>n would have modified the behavior of those who were smokers in the 


J^al world. It appearj^pwever, that many factors affect consumers’ decisions to begin to smoke, 
io .Ql iit smoking, or te^swiich to “safer” cigarettes. According to the Surgeon General, smoking 
mufition is affected IfyfSrental smoking, peer influence, and a variety of other psycho-social risk 


U.S. DHS 



enting Tobacco Use Among Young People. A Report of the Surgeon 


e ral. Ch. 4 (1994); see also Harris Washington Dep. at 25-36. Similarly, a number of factors 
reportedly affect consumer choices among different cigarette brands that, by some measures, pose 


different levels of health risks, as well as decisions by existing smokers to continue or to quit 
smoking. See. e g.. Harris Washington Dep. 36-45; U.S. DHHS, The Health Consequences of 
Smoking. The Changing Cigarette, a Report of The Surgeon General (1981) at 5-6; Redmond. W.H., 
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"Product Disadoption: Quitting Smoking as a Diffusion Process,” J. of Pub. Policy & Marketing 
1 S( I ):87-97 (Spring 1996) at 87 citing U.S. DHHS. Reducing the Health Consequences of Smoking: 
25 Years of Progress (1989). Yet, Dr. Harris incorporates no methodology to quantify the role those 
factors play in consumer choices. He consequently cannot reliably assert the role that the industry's 
a!Tege&jmisconduct played in consumer choices. 

r " v For instance. Dr. Harris obtains his counterfactual world values for u p” by shifting back his 
^knof’ value or peak for ever-smoking prevalence from his factual world estimate of 1974 to 1955, 
r 1965, without pineal justification whatsoever. See. e.g., Harris Washington Dep. 

57. 


REVALENCE OR RELATIVE RISK BASED ON THE 
ATION ARE UNRELIABLE ESTIMATES FOR THE UNION 




i ESTIMATES 
NATIONAL 
k POPULATIO 




Dr. H|saet£ reli 
national dat&sO*sdata 
blsston national data'? 


| estimates of relative risk from the literature that were derived using 
jcular firms, and he, moreover, estimates smoking prevalence values 
, he implicitly assumes a) that the quantitative impact of smoking on 
care expenditufFsTs'the same in the national population or for employees of selected firms 
; s among individ^^gHo participate in health-care programs funded by union trust funds; and 
t smoking prev^e|c¥)estimates using national data apply equally to the participants in the 
ifttiff union trusts. 

Despite Dr. Harris’ deposition testimony that '*... in an ideal world-one would really 

look specifically at the population at hand... nowhere does Dr. Hams assess the validity 
of these assumptions. Harris Maryland Dep. at 66-67 (discussing the African-American population). 
I have prepared propensity score analyses comparing the union populations represented in the NMES 
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and NHIS samples to the rest of the population as represented in those samples, using their sampling 




weights. The results are in Table 3 using the 43 variables in the Harrison Report in Oklahoma for 
NMES. and using 40 of the co variates plus two indicators for current and former smoking from the 
Cambridge Team's June 15 Massachusetts report for NHIS, excluding industry and occupation 

r \ 

Undhgators.'’ In both instances, the analyses revealed that there are substantial differences between 
nion population and the rest of the nation on the background characteristics and other 
punding factors tjtai other plaintiffs' experts in litigation against these defendants believe are 
[Important. The diffej§nces are so substantial that estimates of relative expenditure risk ratios or 

*" i r ;i 

ing prevalence values based on national data cannot reliably be applied to the union population 
ally, much lesssu the participants of the plaintiff Northwest Laborers union trusts. 

. Dr. Dement’sj^$&fate of the relative mortality risk of smokers for particular diseases 
ise is %jS j on tfu?American Cancer Society’s Cancer Prevention Survey II dataset, which is 



Ky 


ofH®?r)he national population or the union population in the United States. I 

r t1 

hf| to conduct property score analyses to reveal the size of the differences between these 
lations and the ACS GPS II dataset. 

Moreover, it |pfb|en documented that smoking trends in the blue-collar population are 
yjirent than in thejpdN§t|f the population, typically declining more slowly in time than in the 
£ral population. See , e g.. Nelson, et a!., “Cigarette Smoking Prevalence by Occupation m the 
jed States,” J. Occup, Med., Vol. 36. No. 5, May 1994. pp. 516-25. 


* The variables exclude occupation and industry indicators, and collapse some other indicators into 
coarser categories; a complete list of the variables used in these propensity score analyses is found 
in the electronic materials accompanying this report. 
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Table 3: Propensity Score Analyses For Union Versus Non-Union 

Populations as Represented in NMES and NHIS 


Propensity Score 


Data 

Source 


Bias 


Variance 

Ratio 


ye —v 


Residuals Orthogonal to Propensity Score 
Variance Ratios in Range 

(0. Vi) {Vi, 4/5) (4/5, 5/4) (5/4. 2) (2, ~) 


FNHIS 

pr.. 

k. 




DR. DEMEN 



.80 .81 4 4 30 4 1 

1.19 -31 7 5 23 5 0 

MR. ROBERTS 

Neither Dr. De^enr^or Mr. Roberts attempts to quantify the impact of defendants' alleged 

^ L i 

I Ir m 

onduct on the trusts’ health-care expenditures at all, either generally (one-component behavioral 

| changes in the trusts’ behavior (two-component behavioral model). 

if 

nstead. they ekclhsiyely attempt to address a casual question different from one involving 


model) or specifically! 


ferendants’£ 


:d misconduct: What health-care expenditures were incurred by the plaintiff trusts 


suit of the exist«i^0^cigarette smoking, regardless of defendants’ alleged wrongdoing? As 




akMilt of attempting^ 




idress this other question, their estimates of smoking-attributable 

ex|phditures, even ifyy|^pjfthe errors in their underlying analyses were eliminated, would still 

p&J L.j , ,, r , 

substantially overstat^fify^alth-care expenditures due to defendants alleged wrongful conduct. 

TfgBfc/act is reflected ih EhTliarris’ allocating only a portion of the dollar amounts from the Dement 

* k 

jf^J toberts reports to the alleged misconduct. 

rifg" 

The analyses of Dr. Dement and Mr. Roberts, moreover, fail to estimate reliably and validly 
the health-care costs incurred by the plaintiff trusts due to the existence of smoking because those 
analyses do not have the essential characteristics that l described above in the first section of this 
report as necessary to generate reliable and statistically valid results: 
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1. Dr. Dement and Mr. Roberts do not even model medical 
expenditures, but rather disease mortality. 

2. Their analyses do not attempt to control adequately for background 
and other confounding variables, other than gender and, in some 
instances, broad age groups. 


\ 


1 4. 


5. 


6 . 



Their reports distinguish only between current, former, and never 
smoking, disregarding differences in these categories of smoking 
behavior that physicians and epidemiologists believe are important. 

Their calculations fail to address the passage of time in the manner l 
described above. 




Their p$f!rli&tions depend upon relative mortality estimates and 
smokijgtg prevalence estimates that are derived from samples that are 
not representative of the populations covered by the trusts. 

Their 0$ gH leave critical assumptions implicit and not evaluated. 
For ep® nowhere in their reports do Dement and Roberts 
scientific IJ/y evaluate their assumption that the relative risk for 
mortality s the same as the relative risk for expenditures. 


r i"t 8. 

Missi 

\ 

inapp 

% s 

Rubir 


missil 

ps 

exam| 

jStSSK 

a. 



Leir calculations consider the possibility that other health- 
.viors (eg., overeating, illicit drug behavior) may be 
affectted:?^ ; lhe existence of smoking. 


[ata problems are severe but dealt with in ad hoc and 
i$te ways. For at least a dozen years (see, e.g.. Little and 
>), there have been principled statistical approaches to 
a, but none of these are used or even referenced. For 


Sing ICD-9 codes (unspecified claim records) are effectively imputed 
using a “best prediction” scheme based at most only on coarse age, gender 
and status (participant/dependent), and moreover, no account is made for the 
uncertainty when applying the resultant composite SAFs to medical 
expenditures. 

The vast majority of the medical claims data are missing and are effectively 
imputed using an entirely ad hoc methodology. There are only four named 
but 57 absent class members, and the only “claims” data from the absent 
members are from 5500 records, which do not report medical expenditures. 
Moreover, even for the four named members, most of the data are missing for 
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the time period under consideration. Not only are '’predictions" created using 
at most only coarse age. gender, status and disease type, but nowhere is there 
any reflection of the vast uncertainty in the resulting estimated expenditures. 


W" 


rV I. 


c. Multiple imputation (Rubin, 1987) is a principled and practical way to 

address missing data in many contexts. An effort to create a 
multiply-imputed version ofNMES is proceeding under my direction. When 
completed, it should be able to support valid statistical inferences for 
estimated expenditure models. 

CONCLUSION 

The plaintiffs' experts’ analyses do not even attempt to generate any estimates of the trusts’ 


Kv 


r". 


fe causedby the defendants' alleged misconduct on the behavior of the 


' tmsf^ (two-component"behavioral model). 
Neither Dr. 

;ess expenditures 


effect 



nor Mr. Roberts even attempt to generate any estimates of the trusts’ 
l^ere caused by the defendants’ alleged misconduct generally (one- 
. Dr. Harris properly recognizes the need to consider separately the 
misconduct on smoking behavior and the effect of that change in 
ing behavior onthe^rusts’ health-care costs. He attempts to estimate the effect of defendants’ 


ed misconduct ggjixafly on the trusts’ expenditures, but his estimates are neither reliable nor 


\ 


stically valid. f®il^estimation is based on first estimating the proportions of health-care 
jnditures from tl ie.jlsj^ ence of smoking that are due to the alleged misconduct of the tobacco 
is try and, second, applying these proportions to the expenditure amounts provided by Dement 
W Roberts. 

":b. 

^ 4 But the Dement and Roberts’ analyses, which attempt to estimate the medical expenditures 

of the individuals in the trusts due to the mere existence of smoking, incorporate essentially none 
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of the characteristics that are critical to obtaining scientific, reliable, and statistically valid estimates 
and so are neither reliable nor statistically valid. 


1 am continuing to address the issues surrounding the plaintiffs' expert reports, and therefore 


may supplement my report in the future. 
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SECOND SUPPLEMENTAL REPORT OF DONALD B. RUBIN 
NORTHWEST LABORERS 


My qualifications and experience were detailed in my two prior reports in this case; my 
September, 16, 1998 initial report and my February 23, 1999 first supplemental report, 

of the data sets upon which Dr, Dement, Mr. Roberts, and Dr Harris rely is from a 
unio§'tnjsf fund recipient population, much less a population representative of the participants in 



the 



Washington StS 



ion trust funds. 

f 

prior reports u|e^dgprqpensuy score analyses to assess the differences in the 
distmH?ops pertinent background characteristics (a) between smokers and nonsmokers within 

experts; and (b) between the individuals represented in 
Jof national union health care fund participants (represented in 


the data sets relied upon 


those') 



Sets a 


NMUS) or a s 


opu 


at»|n^at plaintiffs’ experts believe is a surrogate for a national population 


ofumoodhcalth-care fund s 


its (represented tn NH1S). This report extends those analyses. 



ecifically, plaintM^'-^xperts implicitly appear to assume that the relative mortality risks. 


and, Jppfgrther assumptio|^^|clative expenditure ratios, of smokers compared to non-smokers, 
wi^p^s same in the Cancer Society's Cancer Prevention Survey II database ("ACS 

jas they are among the participants in the plaintiff trust funds, I have now conducted 
propij^^y score analyses using a few variables comparing the ACS CPS-II data base to both the 
natiq&ad^sqipulation as a whole and within certain sub-populations defined by smoking status as 
represented by the NMES and NHIS probability surveys. The results of those propensity score 
analyses are summarized in Table i; the computer code that generates those results together with 
the output are contained in the electronic materials accompanying this report. 


1 
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Allhough 1 used only a few variables, the ACS CPS-fl data are not representative of the 
nation, even in sub-populations defined by smoking status, such as male never smokers. There 
are substantial differences between the ACS CPS-TI data set and the national samples: Typically, 

r ^ 

there%^ypths of well over a standard deviation along the propensity score (bias > 1), with 

|? 

subsfSenii&liy less variability (both along the propensity score and in the residuals) in the ACS 
CPSjtftyA set than in the potion. That is, ACS CPS-II is not representative of the nation as a 
whol 


is much more homogenous than the nation. 

r; ;.i 

Additionally, 1 havepi^ded in Exhibit 1 to this report some further publications in the 



literature that refer to propensity score methods. Propensity 


current epidemiological an<j 

Kv. ' i 

score are becomiri'^fn^asingty popular in medicine, epidemiology, and other applied 

I as ecdisi&ics, «fg| only for assessing biases between data sets, as in Table 1, but also 

# pmf ^ 

g more reliablcLadiustmerits between groups (such as smokers and non-smokers), 


for 

versio: 

uncxj 

p 

availa 


ally, as T advise^psffty first supplemental report, an effort to create a multiply-imputed 

sNnHmmS 


^.W.NV.VV'o,,, 


to address its tpissing data is proceeding under my direction. Due to an 

pi*# 

d and unanticipi^^ding error, the multiply-imputed NM£S data are not yet 
That work is procellmg, however, and, provided no additional unforeseen difficulties 
arise,®! anticipate that multiply-imputed NMES data will be available in the next few weeks. 
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1 may also further supplement this report as appropriate to respond to the expected 


t‘ 


supplemental submissions of plaintiffs’ experts 
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Tabic ft: Propensity Analyses Comparing Subjects In CPS-13 with National Repr esentat ive 
Samples NMES, NHIS-SO, NHIS-82, and NHIS-91: Overall and by Smoking Status 


Comparison Group 


propensity Score 


Bias 


Variance 


Males 
CPS-Uvs 
cps-n-vs 
cps-Hvj 

cps-n vs tons.9: 


Females 

cps-nv* 

CPS-IIvs 
CPS-E ys 
cps-n vs 

Males, Current 
CPS-Evs 
CPS-Evs 
CPS-Evs 

Males, Form 
CPS-C vs 
CPS-Evs 
CPS-Evs 

Males, Never Smoker 
CPS-Evs 
CPS-n vs 

cps-n vs 

Females, Currant Smoker 
CPS-E vs NMSSi# 
CPS-Evs] 

CPS-Evs] 


;oker 





Females, F< 
CPS-Evs 
CPS-Evs 
CPS-E vs NHTSPf 

Females, Ne 
CPS-Evsl 

cps-h vs ] 

CPS-n vs NHTS-91 



1.30 

1.39 

US 

1.34 



1.16 

0,86 


1.07 

1.20 

l.ll 


0.32 

0.33 

0.39 

0.41 


0.48 

0.51 

0.56 

0.66 


0.48 

0.46 

0.51 


0.42 
. 0.51 
0.54 


0.27 

0.27 

0.43 


0.53 

0.57 

0.67 


0.61 

0.58 

0.92 


0.43 

0.48 

0.68 


Variables Used in this Analysis: 

Age, Race, Marital status. Education, Body Mass Index, 
High blood pressure,* and Diabetes* 

•Not used in NH3S-80 comparison 


( 0 , 1 / 2 ) 


Residual orthogonal to propensity icore 
(1/2,4/5) (4/5,5/4) (5/4,2) 


a-) 


4 

1 

3 

4 


3 

4 

3 

4 


4 2 0 
2 2 0 
3 l 0 
1 5 0 


1.5 6 
l 4 3 
1 3 4 
l 2 7 


1 

3 
2 

4 


0 

0 

0 

0 


2 

2 

4 


5 

4 

3 


1 

1 

1 


3 

2 

2 


0 

0 

2 


4 

4 

5 


3 

2 

2 


3 1 0 
2 1 0 
1 4 0 


3 

3 

4 

0 

0 

0 


1 

l 

0 


4 

3 


6 

4 

5 


7 

4 

4 


2 

2 

1 

3 

3 

3 


2 

2 

5 


2 

1 

2 


2 

2 

4 


'1 

2 

3 


0 

Q 

1 

0 

0 

0 

0 

0 

0 


1 

l 

1 


3 

2 

3 


7 

5 

5 


0 

1 

3 


0 

0 

0 


H ^ 
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J. WILLIAMS 


Estimating Exposure Effects by Modelling the Expectation of 
Exposure Conditional on Confounders 


James M. Robins and Steven D. Mark 

Harvard School of Public Health, 665 Huntington Avenue, 
Boston, Massachusetts 02115, U.S.A. 

and 

Whitney K. Newey 

Department of Economics, Massachusetts Institute of Technology, 
i^Wfambridge, Massachusetts 02139, U.S.A. 


SUMMMtY 


Jn order to estimate {ho c^sal effects of one or more exposures or treatments on an outcome oT 
Interest, one has to rrcttprr? for the effect of “confounding factors" wJtich both covary with the 
exposures or treat m|p§j$gi are independent predictors of the outcome. In this paper wc present 
regression methods IwtehTm contrast to standard methods, adjust for the confounding effect of 


continuous, ordinal.fSrd* 
■Approach does not rt farfete 
^for “residual confounding, 
|i)iows a rather genera). J$i 
stcorc" than the iruc&fppe 


-multiple con tinuous toRdketfe tc covariates by modelling the conditional expectation of the exposures 
oj|treatinents given Che jeonfounders. In the special case of a univariate dichotomous exposure or 
tircaiiucn^ggg cond hkgMfet& pcctatlon is identical to what Rosenbaum and Rubin have called the 
j>ropenst&'„scoire. ThW/fiavi* also proposed methods to estimate causal effects by modelling the 
'propcnsitrscoYe. Ou^nltthods Generalize those of Rosenbaum and Rubin in several way*. First, our 
Bppro<W^8BBflfeh»for^ya ^ )»ows for multivariate exposures pr treatments, each of which may be 
continuous, ordinal .fSraisiTeto. Second, even in the case of a single dichotomous exposure, our 
Approach does not r^^^^ciassincation or matching on the propensity score so that the potential 
Tor “residual confoufitting/* 1. c., bias, due to incomplete matching is avoided. Third, our approach 
^allows a rather gcnm)^|g&ftalizaltGn of the idea that it is belter to use the “estimated propensity 
Icore" than the truc jfmbpens iiv score even when the true score Is known. The additional power or 
^ur approach derive)! from the fact that we assume the causal effects of the exposures or treatments 
scan he described by wii'piramctric component of a semi parametric regression model. To illustrate 
,Our methods, we rearuy«c,Jhc effect of currcru cigarette smoking on the level of forced expiratory 
iynhimc in one sceoIPPlr cohort of 2.713 adult white males, we compare the results with those 
Obtained usingstanc^^^^vods. 

Introduction 
ll.l The Problem 

In order to estimate the causal effect of one or more exposures or Ircalmcnts on an outcome 
lof interest, one has to account Tor the effect of “confounding factors'" which both covary 
|with the exposures or treatments and are independent predictors of the outcome, If few in 
asSmber, categorical confounding factors arc commonly dealt with by stratification. When 
there are many confounding factors or when some of tho factors arc continuous, regression 
methods arc used. In this paper we present regression methods which, in contrast to 
standard methods, adjust for confounding by modelling aspects of the marginal association 
of the exposures of interest with the confounders rather than by modelling the independent 


Key minis: Causal inference; Covariance adjustment; Epidemiologic methods; Propensity score; 
Setniparameiric efficiency: Semi parametric regression. 
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association of the confounders with the outcome. Specifically, wc will model the conditional 
expectation Of Vile exposures given the confounders. These methods of estimation will be 
particularly useful when prior knowledge regarding the association of the confounders with 
exposure status is more precise than knowledge regarding their association with the 
outcome. 

For concrete ness, wo shall aitempt to estimate the effect of being a current cigarette 
smoker on the level of forced expiratory volume in one second (FEVI) in a cohort of 2,713 
adult white male former and current cigarette smokers from the initial cross-sectional data 
collected in the Harvard Six Cities Study (Dockery ct a!., 1988). We shall estimate this 
effect while adjusting for tho presence of the 22 potential confounding factors listed in 
Table 1 that include past smoking history, past respiratory symptoms, age, height, and 
coexistent heart disease. In this example the exposure of Interest is dichotomous and we 
assume that thereof no interaction between that exposure and the confounders. That is, we 
assume that the effect of current smoking on FEV1 does not depend on a subject's 

age, weight, prci$§Q;s smoking history, etc. In this setting the most common approach to 
estimating the effect ckeurrent smoking on FEVl would be to postulate a linear regression 
model L.:T' I 


(?i + lIS, + £ PkXij + ci, 

k-i 


E(e,|$ fc X,\ 


where .S',, Xl'=-- 
r s FF.V1 level, £& 
and values on a v 


, Xk.i) are respectively random variables representing subject 
pSlsmoking status (i", ** I if a current smoker and S, « 0 otherwise), 
and vjaig^ on a vector X, of potential confounding factors. Note that the parameter of 
inter ^Q ; is dil$?t|$8ihc<S from the “nuisance" parameters (fh, .... 0 K ) by the absence of 
a sublfetfp. Forjl^ationai simplicity, we shall assume that (IT, S„ X,) arc independent and 
distij^^^-andom vectors, although, with minor modifications, our results will 
hold if the X, afcjrxctfconstanis and the (c r , S,) arc independent across subjects. 

Define or\S, l^^tr[r,|S, X). We write ts\S, X) “ c l if the errors care homosccdastie. 
Unless stated othc||i$c, we shall assume homoscedastic errors, although we do not assume 
that this fact i$?|i|fcwn to the data analyst. The e, arc not assumed to be independent of 
the (.<?,. AT). 

Suppose we are umyillmg to assume that the independent association of the confounders 
X, with the m^^y, has a known functional form. In that ease, we would generalize 
model (1) to f 

f 4 Y, » PS, + h(XA + c, E[e,|5,. AT] « 0, (2) 

where h{X,) if*fk tinknown real-valued function of the vector X,. Model (2) has a 

Table 1 

Twenty-two potential confounders of the effect of current smoking on FliVi 


Age 

Age-squared 

Height 

Body mass index 
Chronic cough 
Recurrent bouts of coughing 
History of treatment for 
heart disease 

Chronic phlegm production 
Chronic wheeze 

Total years of cigarette smoking 
lifetime peek-years smoked 


History of emphysema 

Past history of asthma 

Current asthma 

Former cigar smoker 

Current cigar smoker level «* hi 

Current cigar smoker level • medium 

Current cigar smoker level «= lo 

Former pipe smoker 

Curreut pipe smoker level« hi 

Current pipe smoker level medium 

Current pipe smoker level ** lo 
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scmipurumctric regression function with parametric component fiSi and nonparamclric 
component h{X,). This paper is concerned with the estimation of 0 from model (2). 
Robinson (1985) lias provided an asymptotically normal and unbiased estimator of p under 
a large-sample limiting model in which the number of confounding factors remains fixed 
as the sample size grows. His estimator relics on the fact that, under such a limiting model, 
the unknown function h{,X,) can be consistently estimated by nonparamclric regression 
techniques. In epidemiologic research, the number of confounding variables can be quite 
laf^c. In these instances, the more appropriate limiting model would be one in which wii 
‘ the number of confounding factors contained in X t to increase with the sample 
size (Huber, 1981). 

It is difficult to generalize Robinson’s approach based on nonparamclric estimation of 
7t[ 5 A',) when the dimension of X, is largo. As a consequence, to obtain consistent estimators 
ifi, we shall consider making additional a priori assumptions beyond those specified by 
ipdd (2), The standard approach would be to assume that h(X,) is known a priori except 
a finite number gj^ jitown parameters. As an example, the linear regression model 
assumes that r 


in comrnstto the 
concerning the mar 
)J h[Xi). Thus we slj 
Ids for the mj; 
jodels fopi&yi-Y,) 





Here n « (mi, 
hblascd estimators 
ocificd. 

Although correct 



h(X,) = fit + £ PlXu. 

t-2 

approach, in this paper wo shall suppose that prior information 
v octaiion of S, with X, is sharper than that concerning the form 
ye h{X t ) completely unspecified and instead specify parametric 
elation of S, and X,. Specifically, we shall consider parametric 
I \X,) such as the logistic regression model 


-t 1 \X,\ a] 


exp(oi 4 


(3) 


1 4* Cxp(c« i 4 oqvVu) 1 

c shall show that wc can obtain asymptotically normal and 
model (2) provided our model (3) for p(S a l j>Y,) is correctly 



(lied parametric models for cither h{X,) or p(S “ 11 X t ) will 
rovide nsymptoticahy -nofmal and unbiased estimates of 0, nonetheless, as discussed in 
re next paragraph,jleast djuares estimators of 0 based on models for fi{X,) will always be 
least as efficient IlSSif'estimator of 0 based on models for p(S « 11 AT). This suggests 
iai, for reasons ol^^gicy, it is always preferable to model h(X,) rather than p{S •* 
|| X,). But if, as wpassunje in this paper, our prior information concerning h{X,) is less 
mrp than that corteb§^ p(S - I \Xi), wc would choose not to model /»(*,) in order to 
jjs^^roteet against specification bias. 

’ ps# In order to explain why the ordinary least squares estimator or 0 based on a correctly 
specified model for /i(-Y) is always at least as efficient as any estimator of 0 based on models 
ar E(i’| X,], wc need to review some results from the theory of scmiparamelric efficiency 
ai»|?ounds derived by Chamberlain (1987; and Discussion Paper 1494, Harvard Institute of 
j^Jleo nomic Research, 1990) and exposited by Ncwey (1990). For the moment suppose again 
that, us in equation (1), we were able to correctly specify a parametric model, say, 0) 
for h{X,) depending on a parameter vector 0. In equation (l), 0 ” (j3n.... 0x)- Chamberlain 
(1987) showed that the estimator of 0 obtained by fitting the model Y, « flSi + q{X,\ 0) + 
e, by unweighted, possibly nonlinear, least squares is the most efficient possible estimator 
of 0 that is guaranteed to be asymptotically normal and unbiased under the sole prior 
restrictions that IiMST Xi\ * 0 and A(A',) - q(X,\ 6). (If, us in equation 6) is 

linear in 8, wc fit using ordinary least squares. Otherwise, we fit using nonlinear least 
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squares.] Therefore if. as in model (2), wo are unwilling to specify a parametric form for 
h{X,) and yet want our estimator of 0 to be asymptotically normal and unbiased whatever 
be h{Xi), the asymptotic variance of any such estimator clearly cannot be less than the 
suprcinum of the asymptotic variances of the least squares estimators of fi taken over the 
set of all possible parametric models for h{X,). This supremum is called the scmipnrametric 
efficiency bound for an estimator of 0 under model (2) (Bickcl ct nl„ 1992) and was shown 
by Chamberlain (discussion paper cited previously) to equal rT , ff l /E[var(.!f| A')], where n is 
the sample si7,c. 

Thus, if we arc able to con-cctly specify a parametric model for h(X t ), the least squares 
estimator of0 always has variance no greater than the efficiency bound M" , *VF.[var(5’| .101- 
In contrast, if under model (2), we are unable to specify a parametric model for h{X f ), but 
instead correctly specify a model for E(S| A f J, no estimator that is asymptotically unbiased 
for 0 for all /tfAU pap have variance less than the bound ;r'ffVE(var(.?|X)l. This is a 

consequence o feaM ltt that {(.S/, Xj), ie (1.n)| is ancillary for 0 under model (2) 

(Cox and Hinney, 1974) and, as discussed by Newey (1990), knowledge concerning the 
marginal distri.0uipat^af an ancillary statistic does not affect the scmiparamoiric efficiency 
bound for the j^gg^on of 0. 

It needs to bb stressed that, even when we cun obtain a consistent estimator of fi in model 
(2), it does not4§^gfiP*that the parameter fi can be interpreted as the causal effect of current 
cigarette smok|^|#'EV 1. We now describe conditions under which fi does have a causal 
interpretation. 


it Mr?/:. 


Rubb fc/ 1 
I&saisspt f iCyg 
FEV1 Y,. Iffffflgi 
is subject 
Rubin dcfinedtbe, 
covariates 
know that 
E( = 

Thus a sufficient 


each level ^PTOTfor each A,, 


1978), let Ys~u be subject r“s FEVl had subject (been a current smoker, 

* rcnt -smoker in the actual study, then Jx,,., equals his observed 
: I is not a current smoker, Y $*is missing. Similarly, 
g&|if subject i were, possibly contrary to fact, a eurreru nonsmokcr. 
average causal effect of current smoking among subjects with observed 
fho be EIJs-i|AiJ - Etfs-olA",]. Now, under our model (2), wo 
rri’ ~ 1] ~ E[y|X, .?«=<)]“/? since E[y| Xt, S - I] - fi + HX,) ami 
mXi). 

tjbndition for 0 to equal the average causal effect of current smoking at 


Under Rubi 


! E( y.v,,lX3 - E[ rt = jrJ, j G jO. 11. 
|tl model, equation (4a) is equivalent to 
EMr-,!*.] - E( Ys*t\X» S « s]. 


We shall assume that equation (4b) holds and thus 0 has a causal interpretation when X, 
is the vector of 22 potential confounding variables described above, The assumption that 
equation (4b) holds is nonidcntifiable in the sense that it is compatible with any joint 
distribution for the observable random variables (*5,, X,, T,). When equation (4b) holds, we 
shall call model (2) a semlparumeiric causal regression model, Equation (4b) says that, 
conditional on the joint level of the 22 potential independent risk factors -V,, the mean of 
Ks-i among subjects who actually receive treatment S = 1 equals that among subjects who 
actually receive treatment S- 0, We do not assume that equation (4b) holds when X, is a 
proper subset of the 22 potential confounding variables. 

The mathematical results in this paper are concerned only with the estimation of 0 in 
model (2) and do not depend on whether equation (4b) holds. Of course, in general, we arc 
interested in the estimation of 0 only when we believe it has a causal interpretation. 
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1.3 Relationship to the Propensity Score 

Rosenbaum and Rubin (1983. 1984, 1985) and Rosenbaum (1984, 1987, 1988) have also 
considered estimating the causal effect of a dichotomous treatment such as S, on an 
outcome Y, by modelling p{S — l|AC,) when equation (4) holds. These authors call 
p[S - 11 A)] the propensity score, In contrast to their approach, our approach slraightfor- 
\ wardly allows the treatment or exposure Si to be continuous or ordinal rather than simply 
dichotomous. Furthermore, as discussed in the Appendix, our approach allows St to be 
multivariate so that we can, say, estimate the independent effects of current cigarette 
smoking and past cigarette smoking. In addition, our “regression' 1 approach does not 
require subclassification or matching on the propensity score p[.S ■» 1 \X,) even when X, 
has continuous components so that the potential for "rosidual" confounding, i.c., bias, due 
to the fact that ope has not precisely matched on p[S =* 1 \X t ] is avoided. The additional 
power of our approSte fc^ crivcs from the fact that we assume the causal effect of exposure 
can be described byTKe parametric component of a semipararaetric causal regression model 



such as model (2),. 

Rosenbaum (lf>8 J 
match or subcla: 
small-sample (e 
his causa! model# 
a constant treat 
his results would 
sPV,o,i differed by 
do notsaMOSv on^ 
Final^stis disci 
is & 

score even when 

2. Estimators Bas 
Given Cnnfou 



1^581 also considered specifying causal models to avoid the need to 
| the propensity score. In general, Rosenbaum is concerned with 
cr than large-sample (asymptotic) Inference. As a consequence, 
o be even more restrictive than model (2). Specifically, he assumes 
ci model—that is. Ys-i., “ 0 + Ij-o,/ for alt subjects /—although 
under the weaker assumption that the distributions of IV-m and 
’ parameter 0. Furthermore, as he points out. his “exact" methods 
for the confounding effects of continuous covariatcs. 
d in Section 2, our approach allows a rather general formalization of 
o use the “estimated” propensity score than the “true” propensity 
score is known (Rosenbaum, 1987). 

n^Modcls for the Conditional Expectation of Exposure 



2.1 An JnfecisibleEstbnatcr 


Tn this section, 
models for E(5|^ 
p(S - 11 AT). Ir 
exactly. That is,| 
combination of i 


;r estimators of fi under model (2) when wc can specify accurate 
Note that when S, is dichotomous, models for E(5*| A",) are models for 
for pedagogic purposes, we shall assume that we know E(dT|.V,) 
ime exact prior knowledge of the expected value of S for every 
rffounders X,. Subsequently wc make the more tenable assumption 
that wo know n(5|X) up to a finite vector of unknown parameters. Wc allow &*($, X) to 
* depend on (.?, A"), Henceforth, wc adopt the following notational convention: 0 will refer 
to the true but unknown value of the coefficient of Si in model (2); 0 T will refer to any 
hypothesized, possibly incorrect, value for 0. 

. The estimator we shall consider, which wc call the E-estimator, 


A 


(5) 


£7-i TO ~ K(S|*,)] 

S,IS) - K($l*i>] ’ 
is based on a suggestion by Newey (1990), 

It is shown in Theorem A.l in the Appendix that 0 E has a limiting normal distribution 
with mean 0. 

Tho consistency of A is based on the fact that model (2) implies that 

E[Zi\X h S,) - Et*!*,]. (6) 


r^i 
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where Yi— S,0. In the proof of Theorem A.l in the Appendix, it is shown that equation 
(6) implies the identity 


ETON - 0, 


(7) 



where, for any A’, £/(£*) “ I"~i (Yi ~ - E(S| X/))- The E-csiimator & is the solution 

jUito the unbiased estimating equation U(fa) » 0. 

A Feasible Estimator 

course, the estimator fa is not feasible since, in practice, JEfiS"! A/) is unknown. Wc can 
overcome this difficulty if wc assume a priori that the logistic regression model equation 
holds, Wc then estimate E(S , |Ji7) by logistic regression and subsequently estimate 0 by 



fa = \ 


£ r ,» - Pjlsixm 


2JU s,[51 - EtSI*,)) ’ 


(8) 


w&bre £(A'j JO) is the fitted value A a p[S= 11J0;«] of/s^S** 1) X/), and «is the maximum 


elihood estimator 
ton $£ to represent tofej 
^As shown in Thedref* 

(1990) that when th< 
biased and its asy] 



'* Yi - faS,, Q r is f 



the logistic regression. Note that we use the symbol 4r rather 
Sibic estimator of equation (8). 

u in the Appendix, it follows from Pierce (1982) and Newey 
model of equation (3) is true, 4 e is asymptotically normal and 
covariance matrix can be consistently estimated by 

ar£,(/3 t ) 


var^i(, 


var£,(4t) - £(varii,(«)](5 T 

(9) 

. Xu ms,- - A) 1 

00) 

} [XU S,[Si - A)] 3 , 


vector with J ih component 

ZiM 1 - Pi)Xo 

XU S,(S. - A) 


Or 


- X7”j_ 


here wc define A|P®ii§S&hcn j ** 1), and var« t («) is ihe estimated covariance matrix (i.c M 
o inverse of the ob»rKc4 information matrix) from the fit of the logistic model equation 
). The observed infofmmion matrix has {j, k) entry Ad _ AM^Yt.,. The estimator 
3r™(4x) is not gu#ar&enfl to be positive-definite. A positive-definite consistent variance 
r is obtained by replacing A( 1 - p t ) by (S, - A) 3 both in the numerator of Q/ and 
the observed information matrix. 

Even though fa. is infeasible when p[S « l\X,] is unknown, var£,(4 E ) is still a feasible 
bonsfstent estimator of its asymptotic variance. Therefore it follows from equation (9) that 
ne generates a more precise estimate of 0 by estimating the propensity score E(S| X,) than 
fusing the true population value of the propensity score even were the latter known. That 
is, var«,(4n) is always less than or equal to var&OSe). As discussed in the Appendix, this 
result depends on the fact that o is an efficient estimator of «. The preference for fa. 
compared to fa when the parameter a of model (3) is known can also be viewed in terms 
of conditional bias. Specifically, it can be shown that, conditional on the ancillary Statistic 
[var„(u))" ,/3 (d — o), fa. becomes asymptotically biased while fa remains asymptotically 
unbiased (Robins and Morgenstem, 1987; Rosenbaum, 1987; Efron and Hinklcy, 1978). 
In Table 2 wc present four different estimates 4 k of 0 based on specifying four different 
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Tabic 2 

Estimates (t T . under four different specifications for /;[$ = 1 


XI 


rr 



Analysis 

Covariates included 

in logistic model of 
equation (3) for 

1 IX] 

& 

var?„(&) 
x IO- 

var£,(&) 

x 10' 4 

\ 

<i7 

Constant term only 

-.0580 

9.49 

159.0 

J 

(2) 

Constant, chronic cough (Yes, No) 

.0429 

9.36 

167.0 


(3) 

Constant, pack-years of smoking 

.0520 

7.45 

157.9 


(4) 

Constant, 22 covariaies in Table 1 

-.1133 

8.82 

332,6 



analysis X, is the sin{ 
dialysis X, in cquati; 
jdvamace ai tribute 
comparing var„,(/5 c )| 
Under ihcassumi 
-ye., equation (4b) 

r 




gistic regression ra^^istfor p[S = I (ATI. In the first analysis in Table 2, we assume no 
nfounding. That Unve (it only a constant term n, in equation (3). In the second analysis 
in equation (3) i$-the'sjngle binary covariate—history of chronic cough. In the third 

T’nuouscovariate—lifetime number of pack-years. In the fourth 
the 22-vcctor of potential confounders. The striking efficiency 
limating the propensity score p[S = 11*,] can be obtained by 
^(Pt) in Table 2. 

at (a) the coefficient 0 in equation (2) has a causal interpretation 
hen X, is the 22-vector of confounders and (b) the model for 
- 1IX,] used infan|lysis (4) is true, analysis (4) provides a consistent estimator of this 
usnl /3. p^fon^P^ilmatc that current smoking causes a decrease of .1133 liter in 
:VL A ^^onfii^e interval for 0 Is -.113 ± (l.9G)(.00088) ,/1 «= (-.170. -.056). 
Und<n «|tpUo^ynd (b), we now provide sufficient conditions for the ampler 
alyscs fl)-(3) alsq fo provide consistent estimators of the “causal” (5 associated with 
odd (2) with X, thi|p^^tor of covariates. 

We shall restrict aiusiT^gJo to analysis (3) since the conditions for analyses (l) and (2) arc 
ar. Let X*> bo tJ^ilbvariatc “lifetime number of pack-years" used in analysis (3). fa 
pm analysis (3) wilfljc consistent for the causal $ if either of the following is true: 

&MNAMWIKVI,. 

1): With X, the 22-vector of covnriatcs, or* = 0 for the 21 covariaies 
"other than X*- (i.e., lifetime pack-years is the only predictor of 
22 potential confounding factors). 

be unknown funciion //(X,) = /j(,YV„ .. X K A X « 22. is 
*-.< (i.e., lifetime pack-years is the only independent risk factor 



Ujmcnt condition 
in tho logistic m 
rent smoking aim 

Icient condition (2 
ually only n funct 



^^tong the 22 potential confounding factors) and p[S - 11 AV.il follows a linear logistic 


In general it would be unlikely that an investigator would bo willing to assume that cither 
fthc nbove sufficient conditions held, and thus would tend to rely on analysis (4). 
Suppose equation (4b) holds and consider the test of the null hypothesis £ « o that 
Is if fit ± t.96[var L v(fa)]’ /: fails to include 0. Then except for the assumption that the 
model (3) for p[.9 » HX] is coxrectly specified, this test is an “otherwise asymptotically 
distribution-free'’ .05 ct-level test of the sharp null hypothesis of no causal effect of exposure, 
i.e., of the hypothesis *= Y Sn <\,i “ Y, for all subjects l. 

Rosenbaum (1984, §4.2) proposes a test of this null hypothesis that will be “otherwise 
asymptotically disLributioit-frec” under the condition that OT-u, Yt-o.i) and S, arc con¬ 
ditionally independent given X t . 
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3. Relationship of K-F-stimators to Ordinary Least Squares 

The ordinary least squares (OLS) estimator or fi in equation (1) can he written 

t YAS, - HS\X,)) 


&OLS 


X [it - WIAT)) J ' 


do 


where summation signs without indexes will refer to sums over individuals and whore 
/(SI AT) is the fitted value from the OLS regression of S'on X, and the constant one. Now 
the right-hand side of equation (11) can be written as 

2 YAS, - Hs\x,)) 

^ 0LS “ X SAS, - /(S|Af ( )} 

using the fact that, for OLS, the empirical correlation of the filled values and the residuals 
is zero. Now sUppa&^we had modelled E(SjAfr) *= plS = 11 X,] by the linear probability 
model p[S =* « a, + £f- 2 a k X ki i rather than by a logistic model, and we ft the 

linear probability model by least squares. Then £(S| AT) =* /(5|AT). Therefore, from its 
definition 4e F'IolsI In tho previous section we showed that 4c is consistent if our 
model (or £(^$9^ true. It follows, as pointed ou! by Newey (1990). (hat if, in irinh, 
a k X kJ [i.e., E(S|,Y,) is linear in ATI. then 4 oij? is consistent for 0 
car and thus equation (1) is false. Nonetheless if h(X,) is nonlinear, 
1 0LS provided by standard software packages is inconsistent, 
mujt he used. If F.(S|JQ is not linear in X„ the ordinary least squares 
general, be inconsistent if the unknown function h{X t ) is, in truth, 


E(5|A',) - a, 
even if h{X,) if 
the estimate o 
and equation 
estimate or 4 
nonjin^^vr in 

NOKtfflt 0o' 

cWl.n ( 

similar reason 
"pack-years" f 
0l and 4ois are 
can be checkel 


from the fit of equation (1) for the four choices of AT as in Table 2. 
analysis (2). This reflects the fact that when AT is a single dichotomous 
iinutltaneously linear and linear logistic, and £(51.10 « /{S| AT). for 
jS E in analysis (1). 4 k from analysis (3) using the continuous variable 
enlical to the OLS estimate since £(5|AT) t 4 /(5[A0. The fact that 
can be explained by the near linearity of E(5|AT) in our data, which 
. .plotting E(5| AO versus X,. 

We now dispuss^a modification of the estimator 4 e that has an even closer connection to 
OLS than docs 0 £ . Define 

, Z y,(S, - £(S|AT)) 

I ^ [Si _ L ( sj^)]~ • (l2) 

When L(Si A^wiifilmcar (e.g., logistic), 4m* will not in general equal 4n- Nonetheless, 



Estimates 4ols 

Table 3 

under four different specifications far eovariaics included in 0, -+ 
equation (1) 

2ft. in »>i odet 

Analysis 

Co variates X t ., 
included in equation (1) 

4ols 

vor„,(4r»0* 

X 10'* 

(l) 

Constant term only 

—.0580 

9.SO 

(2) 

Constant, chronic cough (Yet, No) 

.0429 

9.39 

(3) 

Constant, pack-years of smoking 

.0492 

7.64 

(4) 

Constant, 22 covariatcs in Table 1 

-.1199 

8.68 


* Using Wlihc's (I960) hctcrusecdastie consistent variance estimator. 


1, 

I 
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p fm and 0r have the same asymptotic distribution. One obtains 0 £m by regressing Y, versus 
“the residuals” .9/ — £(£]-?,) using OLS regression with no intercept. 




4. Two-Stage E-Estimators 

Throughout this section we assume that the logistic model equation (3) is correctly specified 
with X, the vector of 22 covariatcs. Then 0 e is a consistent estimator of 0 in equation (2) 
without making any assumptions about the form of Ji(X, ). Suppose now that we have an 
a priori guess as to the shape of h{X,). For concreteness, suppose we believed that it{X ',) 
was linear or at least nearly linear in A' (l i.c., h(X,) - 0i + We can now consider 

how to develop an estimator, say 0*, that may be much more efficient than 0 £ if our guess 
concerning the shape of h(Xi) is correct or nearly correct, and will remain consistent, 
asymptotically normal no matter how wrong our guess may be. To construct 0*, we proceed 
in two steps. First 
$eS, on X,. We 


where (0i 



ompuic n(.9|A',) and 0 e as before. We then regress that i, = Y, - 
fine 0* to bo the solution 0 f to the estimating equation 

P'S, -0,-1 0*A*.,W - £(S|AT,)). 

i--i / 


| the OI.S estimates from the regression or ii on X,. Therefore, 

2 {Y, — 0, - faXiMS, - fe(.9|*/)) 

' TstsT 1 £(six,jj 

ipwsDawwcwiW . 

In Theorems A|1 apd A.3 in the Appendix we show that 0* is asymptotically normal 


andpMsed 


,e proposed linear model for h(A7) is incorrect, 
*) is 

3 


- p,) 3 


where ft; 


[Z S<(S, - /},)) 

- &kXkj - 0i and (Q*) T has components 

Z StS, - fit) ’ 


(13) 


In our exampi 
covariatcs. 

In the final 
fi(Xt) were co 
could be ignoi 


with 


:.,(/}*) - 8.79 x 10 ‘ when X, is the 22-vector of 


h of the Appendix we show that, if the linear model postulated for 
en (1) Q* converges to zero in probability so the correction term 
fariti (2) if a\S. X) - a 1 , then var*^*) «■ n~'ff , /H[var(.9|Ar)]. 

When /i(X,) is linear, fiois. will be consistent asymptotically normal and various) "ill 
be less than or equal to vyr*(0*), with equality when E($|A',) is linear in Xi. OTcourse, if 
neither h(X,) nor is linear, 0* but not 0 ols remains consistent [provided the 

nonlinear model for E(^1A',) is correctly specified). When o 2 (S, X) - e 1 and h{Xt) is, in 
truth, linear, 0* has the smallest asymptotic variance among all estimators that remain 
asymptotically unbiased even were h[X.) nonlinear (Chamberlain, discussion paper cited 
previously). That is, it attains the semiparamctric efficiency bound for model (2). 


5, Discussion 

Suppose again that 0 in model (2) is causal [i.c., equation (4b) holds] when X, is the 
22-vcctor of covariatcs. Then the validity of our E-estimators of the causal effect of current 
smoking on FEV1 requires that the semiparamctric regression model (2) and logistic 






52299 4753 





JUL-01-99 THU 08’.40 PH 


P AX NO, 


Y. Jb 







488 


Biometrics, June 1992 



mcrcos 1 
cfficien 
many i 
homos 
that th 

(Newts; 
s|£ram ( 
ampf^ 

QSS- 

of sam 
a guides: 

We f 
asymp 
mode! 1 
on a more 



regression (3) be correctly specified. Specification of(3)can be checked using the techniques 
described by Landwchr, Pregibon, and Shoemaker (1984). The no-interaction assumption 
of model (2) can be checked by nesting (2) in the more general semipnrameiric regression 
model of the Appendix that includes interactions between current smoking S, and the 
covariates in A',, and then testing whether the interaction coefficients are nonzero. 

We note that, rather than simply modelling />[.? ■= HA)) by the linear no-intcraciion 
logistic model equation (3), we could continue to add to equation (3) additional terms such 
as powers of the X k j (c.g.. XI.i, Xi.,) for continuous covariates and all orders of Interaction 
between the various covariates and their powers (e.g., Xh ■ X*., ■ X\.i). This will greatly 
increase the number of free coefficients in our model for p[S “ 1 \X,]. As we ndd these 
additional terms, we derive two benefits. First, we decrease any asymptotic bias in fit (or 
fi*) due to possible misspecificaiion of the linear no-interaction model for p(.S = I j X,}. 
Second^when the linear no-interaction logistic model is correctly specified and thus the 
additioria1;;|$rps arc not necessary to make fit unbiased, the asymptotic variance of Bt (Or 
fi*) is H$Pmcreaxing and will usually decrease as the number of free parameters in tho 
model fpr p[X«= 1 l-^i] increases (see Pierce (1982) and Corollary A.l of the Appendix). 
Thus, ipthfr than having the usual tradeoff between efficiency and bins, we find that 
number of free parameters can lead to improvements in both bias and 
apparent "free lunch" must be tempered by two facts. First, no matter how 
ic add, var A 0§ t ) and var A (|?*) will always exceed »"VVn(vor(S , |.l')) (with 
| errors) (Chamberlain, 1987). Second, the results we have derived require 
res of the free parameters in the model for p[S - 11 A'i) arc «' /3 -comisicnt. 


suggests that «‘^-consistency is sufficient.] This limits the number of free 
may have in our model for p[S - I j A',] as a function of sample size. For 
e could not allow the number of free parameters to equal the total sample size, 
tipn techniques for model selection should be useful in choosing a proper ratio 
:o parameters. Moderate and small-sample simulation studies are needed us 
lice. 

t when the linear no-interaction logistic model (3) is mis specified, the 
iance of the (now potentially biased) estimator fit based on a misspcctficd 
1 \X,] can be less titan the asymptolic variance of the estimator fin bused 
ly parameterized, correctly specified mode! in which the misspecified model 
is nested. Thif phenomenon is evident in a comparison of analyses (3) and (4) in Table 2. 
The variance of fit in analysis (3) is less than that in analysis (4), because 

covariaa^athcr than “pack-years of smoking" are also important predictors of current 
smoking, f , 

ThepSil^il; described in the preceding three paragraphs help to clarify both when 
E-eslithation will and will not be preferable to standard covariance adjustment by least 
squares. Consider first the case in which the sample size is quite large and the dimension 
of A", is small, so that richly parameterized models for either Ji(X,) or p[S «* 1 \X,) can be 
used. Then, as discussed above and in technical detail by Newey (1990), as one adds power 
and interaction terms to the model (3) for p[S = 11 A'J. any bias in fi m and fit would tend 
to zero and the asymptotic variance of /?*, and even fit, will approach the scmiparamctric 
efficiency bound of ir'<r J /E(var|.9(.n]. Similarly, in this setting, if we expanded the linear 
regression model (I) by adding additional terms such as powers of A*., and interactions 
between the Xt., and their powers, the bias of fioua from the least squares fit of (J) would 
tend to zero, and the variance of Bovs would approach the efficiency bound 
rrV/ElvariSIA')}. Thus, in this setting, the use of highly parameterized models for 
h[X,) fit by least squares or the use of highly parameterized models for /z[*9 = l\X,) fit by 
E-estimation leads to estimators of fi with similar properties. 
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Now, consider the ease in which the dimension of X, is large and/or tho sample si/.c is 
moderate. One is then restricted to choosing parsimonious parametric models for h{X,) 

and/or p[S =* 1 {Xi\. Further, since the ratio of the sample size to the dimension otXi is 

small, the power to discriminate between correct and incorrectly specified models for h[X,) 
and/or p[S - L | A",) will be poor. If, as is often the case in an etiologic study, our primary 
interest is in obtaining valid inferences concerning 0 (c.g., confidence intervals that cover 
at their nominal rate), it is essential to try to obtain asymptotically unbiased estimators of 
0. Since, in general, unbiased estimation of 0 requires that the model used in the analysis 
he correct, we would prefer E*estimation over least squares estimation if we believed that 
our ability to specify nearly correct parsimonious models forpIS' — 1 IJf,] exceeded our 
ability to specify such models for h(X f ). This would be the case when the investigator 
thinks, based on substantive considerations, that his or her knowledge of the shape of I he 
regression surface p[S = 11 AT,] is sharper than knowledge of the shape of the function 
In the special cal^ -represented by our example in Section 3, in which the fitted regression 
surface p[S — nearly linear in the X,, E-estimation and standard covariance 

adjustment by least squares will provide similar estimates irrespective of whether /i(,Y,) is 
or is not linear. 

We next cons 1 
the linear model 
heavy-tailed dis 1 
(l), the errors w| 
or R estimators Lis 
3 jvcrc not true, 
csdmary M ycauai 
X, 

greatl0re>1n its 
valfiy^fNhe re 
outside the scopi 
if <r\S\ X) & 

K-csiimators that 
1987). 

Suppose nextijihaLthc outcome of interest is a dichotomous disease variable. Then 1' 
will bo .a licrnotilli random variable. In that ease, one might no longer wish lo specify the 
semi parametric flgSilglv), 

;»!«! E[ Y,\X,, i',] - h(X,) + fiS h 

since lhc fliOdeiygg^ot naturally obey the restriction that probabilities must lie in the 
interval {0. I]. Therefore one might specify a semiparamctric logistic model 


it might be possible to develop robust H-csiimators. Even if 
) wjre true, the efficiency of 4ou would be poor if the errors t, have 
s (Huber, 1981). If we are wiling to assume that, in addition lo 
pendent of the (Si, X ,), efficient robust estimation based on M. I-, 
ble (Huber, 1981). If c, is independent of (S„ X,) but model (1) 
-estimation of model (2) could be based on solving an unbiased 
■he form £,m( Y| — 0 } Si, X,)(Si - Ef.STA',]) ■» 0, where tho function 
be chosen to downweight observations for which Y) - 0Si differs 
ctcd value given X,. (Such observations will be associated with large 
How to chooso the function /«{Y, - 0'S„ Xi) in this setting is 
paper. 

n X alone or on S and X, it is possible to develop “weighted" 
c more efficient rhan the E-estimators 4c or 4* (Chamberlain, 



b(Y,\X„ S,] 


expf/r(A'j) 4S f ] 

1 + exp[/i(Y/) + PS ,}' 


(14) 


Unfortunately, the approach developed in this paper will not allow us to consistently 
^estimate the p orequa lion (14) even though Bickcl ct al. (1992) and Chamberlain (discussion 
■paper filed previously) show that. In principle, there should exist an « ,/s -consistcnt estimator 
of p based on data (X,, S„ Y,) [at least when the dimension of X t is fixed as the sample size 
increases]. Our approach fails because it is fundamentally based on the fact that, for model 
(2), 0 is identified from the “pseudo-data" (Si, V„ Y,), where^ Y, - EOS] A',). We call V, 
“pseudo-data," For example, if V, were known, our estimator fa docs not require daw on 
X,. ft con be shown that 8 in equation (14) is not identified from pseudo-data (S’,, V,, Y,) 
due to the “noncollapsibility" of the logistic parameter p when we collapse from the “raw 
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data" X, to V,. Indeed, Gail, Wicand, and Piantadosi (1984) essentially prove this noniden- 
tifiability result in the special case for which V t «= 4- for all subjects. In Ihct, suppose X, were 
dichotomous and thus e* was the common exposure (Sj-disease (E) odds ratio in the two 
2x2 tables indexed by the levels of X. In this special case, the nonidentifiability of P when 
V, is n fixed constant for all subjects / is simply a restatement of the following well-known 
fact. Even when J and X arc (marginally) independent, the common odds ratio e* is not 
identified from data (S,. Y,) since the marginal exposure-disease odds ratio (ignoring X) 
ay diflcr from e* and the magnitude of the difference depends on the distribution of X 
tail ct al„ 1984). However, in contrast to our nonidentifiability results for the 0 of model 
14), if equation (4b) holds, the average causal effect of 5 on disease Y , i.c., E(y y _,) - 
Isllj-o]. is identified from (,9 r , K f , Y.) (Rosenbaum and Rubin, 1983). 

I Suppose next that Y t has a Poisson or overdispersed Poisson distribution. We might then 
wish to specify semiparametric log-linear models, e.g., 

K, %lYi\X h 4] = zxp[h[X,) + /W/]. (IS) 

For log-linear mod^ n simple modification of our approach can be used to consistently 
timatefr from ps^udo-dfta (JS>. V,, V,). Specifically, since, under model (15), F.[U(/3)) *» 
where f " i 


'' U{tf) « I Y.f'HS, ~ E(5|A/)), 


<I6) 



sosjimaior p T can 

jmx, i 



ct al. tt bfaajitise an 
rom observational 


the solution fie. to will be consistent, asymptotically normal. A feasible consistent 

‘ ■* from data (S ly X,. Y,) by specifying a (correct) model for 

Ration can be extended to estimate the causal effect of a tl mc- 
tent. Jgg^rifically, Robins (1989a, 1992a, 1992b. 1992c, 1992<1) and Robins 

-- ej^^ion of E-cstimation, which they call G-cstimation, to estimate, 

lafnf lhc causal effect of a time-varying treatment both on a survival 
ime outcome and ^ SBW^ olution of the mean ora continuous outcome variable measured 
epcatcdly over time irij^p presence of time-dependent confounding factors. Robi ns (1989a, 
1992b, I992d) us^^^c|timation to correct for noncompliance in randomized trials 
studying the effect iFa unte-varying treatment both on survival lime outcomes and on the 
^sg^volution of the metirfoTa continuous outcome variable when noncompliance depends on 
0me-dcpendcnt prgp^^fc factors. G-cstimation is of particular importance in estimating 
|hc causal effect of a rime-varying treatment in the presence of time-varying prognostic 
factors because stap^^^bvariance adjustment based on time-dependent Cox proportional 
wzjud models for atuMva l time outcomes or generalized estimating equations (Liang and 
eger, 1986) for rlf>f't?ft measures outcomes cannot consistently estimate the treatment 
.‘•effect (Robins, 1986, 1989a, 1989b. 1992a, 1992b, 1992c). 
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RtSUMfe 

Pour estimer 1'influencc d'un ou plusieuts factcurs tur une variable d’intfcrtt, il faut prendre en 
comptc ics effets dcs covnriablcs qui d*une pan varient avee les dits fadcurs, ct d'autre pan aident 



ittp://legacy.library.ucsf.edB(^ito<j((ltpEBOl(WpyBWv.industrydocuments.ucsf.ed 


52299 4756 





hAX NU, 


r. jo 


JUL-OT-99 THU UB:44 Yn 



Estimating Exposure Effects 


491 





prtdiro la variable d'intcrct, mdependamment de ccs facieurs. Dans eel article, nous prtiCntons des 
mfcihoiltis de regression qui, a la difference des mdihodes usuellcs, ajusient i'effel confondaru tic 
plusicurs covariables (continues ou discrilcs) par modilisation dc fcsperancc condiiionndle ties 
differents fecteurs t n fonciion dcs covariables. Dans Ic cas paniculier d’un scul faclcur & deux niveavix. 
ecus espenmee conditionncllc out identique 4 cc que Rosenbaum et Rubin om appclc ic score dc 
propension. Ccs auteurs, d'ailicurs, ont aussi propost des methodes d’estimation passant pur la 
modfilisation dc ce score dc propension. Nos mtthodes gincraliscnt celles de Rosenbaum et Rubin 
de plusieurs manieres. Tout d’abonl, norre approche s'ftcnd d’cmblte 4 tous les cas de figure possibles 
pour les factcurs, chacun d’entre eux pouvani tire coniinu. ordinal ou diseret. Ensuitc, meme dans 
le cas d'un seu! factcur 4 deux niveaux, noire approche ne ntcessite pas de classification on 
d'appariemeni d'uprts le score dc propension, de relic sorte que le risque de “confusion rcsiducHc" 
(c'esl-fl-dire dc biais) lie a ccs methodcs esi evite. Enfin, notre approche permet dc confortcr I'iiltc 
qu'il vnut mieux utiiiscr Ic score dc propension eslimt que te vrai score de propension, mf me lorsqut 
cc vrai score esi connu. Le surcroit de puissance de notre approche* provieni du fail quo nous 
supposons que (’influence des factcurs peut Circ dfcrile par la composantc pirometrique d’un module 
do regression sohu-paramitrique. A litre d'illustration, nous rtanalysons, sur une cohortc dc 2.713 
adultcs blancs dpj^gi^iascuUn, 1‘efict du tnbnc sur la valeur du volume expiratoiie maximal seconde, 
ci nous comparpfTcs rcsultats ohtenus avec oeux dcs mtihodos classiques. 
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Appendix 


ovc the results slated in the text. Wc assume that 
= AS,. x„ ft) + HX,) + e„ EU, ( S„ X.) - 0, 


where JIS„ X,. ft) ufe 
when 57 ■* 0. (Extcll 
the text is the special 
generalises (2) by j 


(A.0) 

| function of a K-dimensional parameter vector ft that takes the value zero 
our results to nonlinear functions of/9 is straightforward.] Model (2) in 
in which f[S t , X,, ft) = ftS, for univariate ft and dichotomous S’,. (A.0) 
vfor multivariate exposures, each component of which may be categorical. 

ordinal, or cominuous. FOr example, we might suppose S, ” (.57,. Sun) and /{$„ X„ ft) » 

I:"., ft,„Sn„ + ftuJSlIXii whh V m M + I. If equation (4b) holds when s is any value of S,, then 
JIS„ X„ ft) is the iayeraug effect of joint exposure level S, compared to the baseline level S, = 0 
among subjects w pW wBiatc level X„ If f(S„ X,. (5) dopends on X„ wc say there is an cxposurc- 
covariatc interacti on,._ ; 

Define the R-vccror of partial derivatives of AS., X„ ft) with respect to the 

components of ft \ 


E [MS,, A,)|A,} * tiX,i a), 


(A D 


where r( ■; .) is a known function and « is an unknown parameter. Define R(S„ A*,;«) s*MS„ X,) - 
r[X,\ «). Note that X,-, a)] A,) ™ 0. If, as in the text, f){S,, A7) “ 57 is a Bernoulli random 

variable, (A. 1) is a fully parametric model for S, given X,\ otherwise, (A. 1) is a semiparametric model 
for the density/)(57) A,), since the distribution of R(S„ AT,; o) is completely unrestricted except for 
having mean aero given X,. 

Now for any nonrandom function g(jt), define 

tc M Uifi', s . 6) = n" n X. XY,-AS., X„ ft') - g(X,))RlS., Xr, a) 

= <T' n X ,VAft\ *. a), (A.2) 

where a is asymptoticallyequivalent to an /i ,/S -consiSTent solution to 0 «* ^m(J,, Xi, n*) 

for some /<,(«’) satisfying = 0. That is. when M(*) is continuously differentiable, 

n ,/! (A - «) - — lU[dW;(a)/0«']|" ’n" ,/} l, Af,(<*) + </,(]), and we say that -|lv[8M(«)/a«’]|‘ , M(o) 
is the influence function of b. Chamberlain (1987) proves that A is semiparametric efficient for n 


V. lb 

1 
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under the sole restriction (A.l) on the conditional distribution of & given X, only if Mfa) equals 

Mf(a) * — -r 1 - ^ [var[ft(S„ X,\ X,\ a), 

da 

That is. a is semiparameLric efficient only if it is asymptotically equivalent to the optimal weighted 
(possibly nonlinear) least squares estimator of a. Henceforth, we shall say that a is semiparametric 
efficient under (A.l) if a lias influence function -|E(dMf' r («)/da'ir’/l/* ,r («). IFjMSi. X,) *= S, is 
F Berff^ulli, scmiparametric efficiency under (A.l) is just ordinary parametric efficiency. Our main 
i resu$ is given as Theorem A. t. 


Theorem A. 1 Under regularity conditions given in Corollary I, Chapter fi of Manski (1988), there 
fc a solution Hdg) =» BelS, «) to iT in U(fi\ s, o) ■= 0 such that - 0) is asymptotically 

KqI with mean 0 and variance that can be consistently estimated by 

/-‘AteX/T 1 . (A.3a) 


_ me. S'«YM = I, MS,, x,ms„ x r , ay. 

f "T ' 1 &(*) - «“* £ S, «>*.*(£. jr. «), 
!, g, «*) - VA&, R,i)~ ms)d-'M,(cc), 

a « fag), 



Big) * n" z, MS, 2 «)A>«* = «*' U V. - AS,, x„ fi) - six,)} 


ms, x,\ &) 

da' 


C> 


Z, Hi,(tr)/da. 


ere (?( 5 -) ** f~‘B[g ),! 
/»{•*!■> 


pent under (A.l), the asymptotic variance of - P) can be 

/*' 2 UX/T 1 - &g)6Q‘lg), (A.3b) 

y, U,[B t {g), g, a)U,{$klg), g, a)\ and ft is a consistent estimator of 


^ , ... _als a dichotomous .9, (as Irv the text), varffifii, X,\ «) [X,] may bo an 

unknown function of Hence, if one chooses to estimate a by the unweighted (possibly 

- iplinear) least squarcsfwjjressjon of f t (S„X,) on X„ it is necessary to use formula (A.3a) rather than 
,3b), since n will then be efficient, only if the (unknown) variance of ^(J JP X,\ a) does not depend 

Inwcver, if one hasp correctly specified model var[A(i',, A’,; «)(*,] = +[X t ; C), where MX,;0')h 
nown function and ; .$iS Mhirknown parameter, then it is well known that the estimate » that 
ms 0 = S.iariAT,; «))- l rt($„ X ,\«) has influence function -lEfeAffWaor' W'Alfla) 

I (A.3b) can be 6 is the (possibly nonlinear) multivariate least squares regression 

imnte of 0 obtained by regressing R(S„ X,\ «)R'(S„ Xr «) on X„ where a is obtained from a 
liminary unweighted least squares regression of£(.9„ X,) on .V,. 

'plication of Theorem A.l Consider equation (9) in the text. In that setting, g(Jf,) «= 0; 
X„ p) - 0S-JAS., X,) = Sr, R(Si, X,\ A. where p, ~ r x ‘/{\ + tT ' x <)\ 

■N mS ’^' a) = Vu + e " ,v ~ 

y, - AS„ X,, Pdt)) - Six.) » | f ; Utfae), R, o) - its, - A)f 1 - if'lA (S, - A); 

t(g) “ n~''LV(S, - pi)\ OU) ~ IAiM 1 -P-)Xf/i,S,{S, - A). Substituting into equation (A. 3b). we 
obtain equation (9). 

The reader cun check that substituting in (A.3b) also gives equation (13) if we set s(>V,) = 
fit + JiiXkj above. (As we shall see in Theorem (A.3) below, the faet that g(,X,) is based on 
estimates Pk docs not affect the asymptotic variance (A.3b).J 



V. dU 
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Proof of Theorem A. 1 For pedagogic purposes we first sketch a proof. We then show how Corol¬ 
lary i of Manski’s Chapter 8 can be used to formally prove iho theorem. Since 

E[tf,(|J.f.a)]«0 (A.4) 

by (A,0), we have that, under the regularity conditions discussed below, a Taylor expansion and the 

weak law of large numbers (WLLN) gives 

0 = n~' n U(Mg), x, a) = n-'^Vig) + ” 0)] + B(g)[n' n {A - a)] + 0 ,( 1 ), 

|where n' trl U{g) *» B n ' ir} U ((#, g, nr), I a E[SU,[fi, g, a)/W], which does not depend 

* on X- 


B{g) « E 


, ( 0U.lfi> X , 

' \ /■ 


5 Mto - (?) - -/-- <*) + n-‘ A V(f!)] » 0,(1). (A.5) 

By assumption, = -C~'n~' n T,M, + o,( I), where C « E[dA/,(n)/3aJ, M, ■ Hrncc 

- P) - ^r'n^LWs) - B{g)CM\ + o,(l). Thus n'' 2 (&.(*> - 0) is asymptotically 
"Zhormal with mean §£er|^-a fid variance / _l i(^)(/')* , « where 4(g) — varlt/,(g) — since 

5 fl M ^flt(i?) - P) bigfegM of independent mean-zero random variables plus a term of o,(l). 
Formula (A.3a) fnlllmpf the V/LLN. 

d Wc next cstablisly^.jM for a scmipammctric efficient A under (A.l) using arguments similar to 

^ those in Pierce (19©) ^rafdewey (1990). Let L,(a\ v') r*AS,\X,; a’, ij r ) be any (regular) parametric 
suhmodol with truc p$ | & g^, t, for the density of $, given X, consistent with the restriction (A. I). l.ct 
S„ ««Mt» /.,(«, iQ/ fra*. l eL 

r k |fl(fM/fn7(.y„ X,) - 0 in L,(,a , r>)/$n' for some parametric submodel). 


Note rf»fo(5„ P$P^(.S,. ,V,)|*,] = 0 and E[*(A’„ X,\ o)r7(S,, Xi)’ \ X,} » 0| since the 
^ scores ?fe§^V,) arc$^§suicied only by having a conditional mean of icro and by being condi- 
tionalJy’jjJtcorrtlai^wtiih T((.S,. X,\ «). It follows from Chamberlain (1987), Begun ct al. (I9S3), 
„ and Kd#eni990pe^ S„., - M? 6 r and E(A/‘X^„ X,y ] « 0 for all a(S,> X,) G r and 
(b) var A i«' /! (« — oJL.™ 1H Jy7 ir (A/? IT )'))~ *. Aff is called the efficient score in the semi para metric 
^ model (A.l) for th«lppeP& given X,. 

Now by differentiaiin^:||ic identity E,.,vt U,U), g, cr‘)] *» 0 with respect to a’ using the chain rule 
Jl and evaluating at values («, »)), we obtain B(g) ® — Ef ll,(g)Si. i]. where refer* 

to expectation wiuf^cspcet to a density that differs from the truth only In that the law of S, given 
J X, is f(S,\X,-, «^vW»§imi!ariy differentiating this identity with respect to q\ we obtain 
^ lef JO] — 0 for>11 a(S,, JO e r. Thus, by (a) in the last paragraph, we conclude B{g) ** 

i —l ; -[UAg)(Mfy]. Similarly, the identity E,i..'[/f,(rr*)) ” 0 implies C “ —E[Af ( (A/,Hence 
1 K,(g) * V,{g) - ftvffu'M, = Lf/(.tr> - In the special case in 

f which M, m ypf % the residual front the (population) least squares regression of U,(g) on 


K,(g) «- U,(g) - = U.is) - EIC',(gM,Vff ,r )')|E[A/ ) {«J n )')r , W,. In the special case in 

which M, m A/;' r , the residual front the (populalion) least squares regression of U,{g) on 

Mf, und a standa/dcflcufaiior gives var(A.‘,(.tr)l *» var(6'U')l - B{g)C~'B '(g)- (A3.b) then follows 
by (b) in the Inst the WLLN. 

Theorem A.l isiormmly proved by noting that it is an immediate consequence of Corollary 1 in 
Manski’s Chapter 8 and the above variance calculations when wc set Manski's function g(z, b) equal 
to (C/(/3\ g, ft*)', AH*')’)' and Manski’s function r(.r) equal to x'x, where x is a vector. 

Corollary A.l If is scmiparametric efficient under the yth of J nested correctly spccifiotl models 

£[/.(&, X,)\Xi) - r(A','^o u) ). U “ 1. J\ with the dimension of a 01 increasing with j, then the 

asymptotic variance of Hz^Kg) “ fitig, or 0 *) is nonincreasing with/ 

- Proof Correct specification implies that, for J > J*. is the first j* components of Afi n,J , the 

efficient score for the yth model. But, by standard least squares theory, the variance of the residual 
Af, w (y) based on thc>th model must be less than or equal to that based on model j*. 

The following theorem will be used in proving the claims made in the paragraph following 
equation (13). 

Theorem A.2 

(a) var^n’^CAtfg} - 0)) ^ var A fn ,/3 (^ E (/r) -/?)]. , , 

(b) var^n''’(&.(h) - (?)] 17 vari'{n ,/1 (di.(h) - 0)). where iSi.(/j) * 0 L (/>, «) and ^ E (ft) =* fi v (h, A). 
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Proof of (a) (a) is an immediate consequence of the following two lemmas. 

Lemma A.l T he function £ minimizing vaP[n' / ' 3 (4e(g) ~0)J also minimizes vur A In -t/, y(d. g. <i)). 
Proof By a Taylor expansion, wo have 

0 = n-' ,l m$dgl g, «) - n-' r -OU 5, g. e) + #T‘ fV'T&Cs) - /S)] + o,(l). (A.O) 

A further Taylor expansion of n^ftlKft, g, a)/6ft’ around a 0 and the WLLN proves 
g, 6t)fdft" — / 4- 0 ,( 1 ), proving the lemma. 

Lemma A.2 The function h minimires vur g. a)J. 

Proof n- l ' : U(.0, g, i) - «(.?,. X,\ «) + ri" n MX,) - g{X,))R{S„ X,\ a) * A, + A,(g), 

say where we have used (A.O) to substitute e, + h(X.) for Y. - f(S„ X„ B). If we can show 
cov A (ri 1 , Aj(g)) - 0, then var A [rT ,/J f/(ft g. A)) * var^/i,) + vnr'T/jjfg)], which is minimized at 
g - h sinco s vapM,(/j)] = 0. Now A , and have zero covariance since (a) fi{,li|(S. X)) «■ 0 
and (b) Mjtg^ifcated B'vcp (S, X) « KJ,, X t ); l =- (a) and (b) follow from the fact that 

E[e,)(5, XjjpnTand a depends on the date only through ($, X). 


Proof of d) 


lows from the fact that B(h) >= 0 by (A.O). 


In general, we dp not know /i(A',). Therefore, as in Section 4, wc shall hypothesize a model 
MX,) *= giXgSH&gjgikte g (■, -) is a known function and it is a vector of parameters to be estimated. 
We estimaid/yHrtj^sibly nonlinear) least squares regression of Y,—f(X,, X,, 0A on X„ where ft, is 
Ut) J?( 0 be the (possibly nonlinear) least squares estimator of 9. It is clear that, since 

Pi ts an n 1 t estimator of ft, if the model for h(X,) were correctly specified, Ji ,/: (fl — 0) 

would hflve&^spttllgcncrate limiting distribution with mean 0. If the model for 71(20} were misspe- 
dfie4( ihcre^jjlgmts 0* such that n lrt (fl — 5*) has a aondegencrate limiting distribution with mean 
^int follosing#eorem shows that we can then use 6 10 construct an adaptive estimator or ft that 
ptpsife the sg$n limiting distribution as (J c (/i) if our model h{X t ) is correctly specified and (2) remains 
t, ^y/pplgtically normal even if our model is misspeemed. 

Theorem A.} lt o m (i - #*) has a nondegenerate limiting distribution with mean 0, then 
ficlgiX,, limiting distribution as $t[g(X,, 0*)j. In particular, it will be consistent am) 

asymptotically Tmmial whether or not the hypothesized model for MX,) is correct, and it will have 
the same lira^^'ojstribulion as $1 (h) if the model for h{X,) is correct. 

Proof Cor rotational convenience, assume that 9 is one-dimensional. It will bo sufficient to show 
that V—< 


Theorem AJL 

asymptotically j 
the same liratp 


for Us' -ft 



n~' n U(ft\ ri, gm - tr' n mfi\ &, g(9')) + 0 ,( 1 ) 

l/1 ). By a Taylor scries expansion 

g(ti» - n" n i (/(*’, a. g{0')) + 
n l/i (6 - 6')\n-'Ull)\ 


„>*(§ - r ) : l>r‘ U[fi\ a, r(8*)lJ (A.9) 

for some 0“ between 6 and S*. Now, if ft' « ft, by Theorem A.l and Pierce (19S2), [tr'U\ft' t ri, 
f'(0*)]] converges to 0 in probability since it has mean 0 to o,(n* l/5 ) with variance converging to 0 
as /!-+«. Further, under regularity conditions, this remains true if |0’ - di» 0{n~' n ). It then 
follows from Slutsky’s theorem that expression (A.8) converges in law to 0 and thus in probability to 
0. Further, since a, /'(#')] is at most 0,(1) and n ,/ ’(0 — 0*) 1 is 0,(w" ,/s ), it follows that 

expression (A.9) is 0,[n~ ui ). Thus equation (A.?) is true. 

Theorem (A.3) and part (b) of Theorem (A.2) Imply proposition (l) in the paragraph following 
equation (13). Proposition (2) is an easy calculation. 
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